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Abstract 

This paper explores the homogeneity of coefficients in high-dimensional regression, 
which extends the sparsity concept and is more general and suitable for many appli- 
cations. Homogeneity arises when one expects regression coefficients corresponding 
to neighboring geographical regions or a similar cluster of covariates to be approxi- 
mately the same. Sparsity corresponds to a special case of homogeneity with a known 
atom zero. In this article, we propose a new method called clustering algorithm 
in regression via data-driven segmentation (CARDS) to explore homogeneity. New 
mathematics are provided on the gain that can be achieved by exploring homogene- 
ity. Statistical properties of two versions of CARDS are analyzed. In particular, the 
asymptotic normality of our proposed CARDS estimator is established, which reveals 
better estimation accuracy for homogeneous parameters than that without homogene- 
ity exploration. When our methods are combined with sparsity exploration, further 
efficiency can be achieved beyond the exploration of sparsity alone. This provides 
additional insights into the power of exploring low-dimensional strucuture in high- 
dimensional regression: homogeneity and sparsity. The newly developed method is 
further illustrated by simulation studies and applications to real data. 
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1 Introduction 



Driven by applications in genetics, image processing, etc., high dimensionality has become 
one of the major themes in statistics. To overcome the difficulty of fitting high dimen- 
sional models, one usually assumes that the true parameters lie in a low dimensional 
subspace. For example, many papers focus on sparsity, i.e., only a small fraction of coef- 
ficients are nonzero. In this article, we consider a more general type of low dimensional 
structure: homogeneity, i.e., the coefficients share only a few common clusters of values. 
A motivating example is the gene network analysis, where it is assumed that genes cluster 
into groups which play similar functions in molecular processes. It can be modeled as a 
linear regression problem with groups of homogeneous coefficients. Similarly, in diagnos- 
tic lab tests, one often counts the number of positive results in a battery of medical tests, 
which implicitly assumes that their regression coefficients (impact) in the joint models are 
approximately the same. In spatial-temporal studies, it is not unreasonable to assume 
the dynamics of neighboring geographical regions are similar, namely, their regression 
coefficients are clustered. In the same vein, financial returns of similar sectors of industry 
share similar loadings on risk factors. 

Homogeneity is a more general assumption than sparsity, where the latter can be 
viewed as a special case of the former with a large group of 0-value coefficients. In 
addition, the atom is known to data analysts. One advantage of assuming homogeneity 
rather than sparsity is that it enables us to select more than n variables (n is the sample 
size). Moreover, identifying the homogeneous groups naturally provides a structure in the 
covariates, which can be helpful in scientific discoveries. 

Regression under the homogeneity setting has been studied in a few literature. First 



of all, the fused lasso Tibshirani et al. , 2005 , Friedman et al. , 2007 can be regarded as an 
effort of exploring homogeneity, with the assistance of neighborhoods defined according 
to either time or location. The difference of our studies is that we do not assume such a 
neighborhood to be known a priori. The clustering of homogeneous coefficients is com- 
pletely data-driven. For example, in the fused Lasso, where given a complete ordering of 



the covariates, Tibshirani et al. 2005 add L\ penalties to the pair of adjacent coordinates; 
in the case without a complete ordering, they suggest penalizing the pair of 'neighboring' 



nodes in the sense of a general distance measure. Bondell and Reich 2008 propose the 
method OSCAR where a special octagonal shrinkage penalty is applied to each pair of co- 
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ordinates to promote equal-value solutions. Shen and Huang |2010 develop an algorithm 
called Gouping Pursuit, where they add truncated L\ penalties to the pairwise differences 
for all pairs of coordinates. However, these methods depend either on a known ordering 
of the covariates, which is usually not available, or exhaustive pairwise penalties, which 
may increase the computation complexity when the dimension p is large. 

In this article, we propose a new method called Clustering Algorithm in Regression via 
Data-driven Segmentation (CARDS) to explore homogeneity. The main idea of CARDS is 
to take advantage of available estimates without homogeneity structure and shrink those 
coefficients, that are estimated "close", further towards homogeneity. In the basic version 
of CARDS, it first builds an ordering of covariates from a preliminary estimate, then runs 
a penalized least squares with fused penalties in the new ordering. The number of penalty 
terms is only (p— 1), compared to p(p — l)/2 in the exhaustive pairwise penalties. In an 
advanced version of CARDS, it builds an "ordered segmentation" on the covariates, which 
can be viewed as a generalized ordering, and imposes so-called "hybrid pairwise penalties" , 
which can be viewed as a generalization of fused penalties. This version of CARDS 
is more tolerant on possible misorderings in the preliminary estimate. Compared with 
other methods for homogeneity, CARDS can successfully deal with the case of unordered 
covariates. At the same time, it avoids using exhaustive pairwise penalties and can be 
computationally more efficient than the Grouping Pursuit and OSCAR. 

We also provide theoretical analysis on CARDS. It reveals that the sum of squared 
errors of estimated coefficients is O p (K/n), where K is the number of true homogeneous 
groups. Therefore, the smaller the number of true groups is, the better precision it can 
achieve. In particular, when K = p, there is no homogeneity to explore and the result 
reduces to the case without grouping. Moreover, in order to exactly recover the true groups 
with high probability, the minimum signal strength (the gaps between different groups) 
is of the order maxfcj-^/l log(p)/n} where |^4fc|'s are sizes of true groups. In addition, 
the asymptotic normality of our proposed CARDS estimator is established, which reveals 
better estimation accuracy than that without homogeneity exploration. Furthermore, our 
results can be further combined with the sparsity results to provide additional insights on 
the power of the low-dimensional structure in high- dimensional regression: homogeneity 
and sparsity. Our analysis on the basic version of CARDS also establishes a framework 
for analyzing the fused type of penalties, which is to our knowledge new to the literature. 
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Throughout this paper, we consider the following linear regression setting 

y = Xf3° + e, (1) 

where X = (xi,--- , x p ) is an n x p design matrix, y = (yi,-- - ,y n ) T is an n x 1 
vector of response, (3° = (/Sj 1 ,--- ,/3p) T denotes the true parameters of interest, and 
e = (ei,-- - ,e n ) T with £j's being independent and identically distributed noises with 
E(ei) = and E{e 2 ) = a 2 . We assume further that there is a partition of {1, 2, • • • ,p} 
denoted as A = (Aq, A\, ■ ■ ■ , Ak) such that 

0} = p% k for all i G A k , (2) 

where ft\ k is the common value shared by all indices in A k . By default, fi\ = 0, so Aq 
is the group of 0-value coefficients. This allows us to explore homogeneity and sparsity 
simultaneously. Write j3\ = '@ak) T - Without loss of generality, we assume 

Our theory and methods are stated for the standard least-squares problem although 
they can be adapted to other more sophisticated models. For example, when forecast- 



ing housing appreciation in the United States [Fan et al. 2011 , one builds the spatial- 
temporal model 

Y it = X[ t (3 i + e it , (3) 

in which i indicates a spatial location and t indicates time. It is expected that (3[s are 
approximately the same for neighboring zip codes i and this type of homogeneity can be 
explored in a similar fashion. Similarly, when Ya represents the returns of a stock and 
Xj£ = stands for risk factors, one can assume certain degree of homogeneity within a 
sector of industry; namely, the factor loading vector j3 i is approximately the same. 

Throughout this paper, K denotes the set of real numbers, and for a positive integer 
p, M p denotes the p-dimensional real Euclidean space. For any positive sequences {a n } 
and {b n }, we write a n ^> b n if a n /b n tends to infinity as n increases to infinity. Given 
1 < q < oo, for any vector x, HxHq = ■ Ixjl 9 ) 1 / 9 denotes the Lg-norm of x. In 
particular, Hx^ = max{|:Cj|}. For any matrix M, ||M|| g = max x .|| x |j 9=1 ||Mx|| g denotes 
the matrix Lq-norm of M. In particular, HMH^ is the maximum absolute row sum of 
M. We omit the subscript q when q = 2. ||M|| max = max{|Mjj|} denotes the maxtrix 
max norm. When M is symmetric, A max (M) and A m i n (M) denote the maximum and 
minimum eigenvalues of M, respectively. 



The rest of the paper is organized as follows. Section 2 describes CARDS, including 
the basic and advanced versions. Section 3 states theoretical properties of the basic 
version of CARDS, and Section 4 analyzes the advanced version. Sections 5 and 6 present 
the results of simulation studies and real data analysis. Section 7 contains concluding 
remarks. Proofs can be found in Section 8. 



2 CARDS: a data-driven pairwise shrinkage procedure 
2.1 Basic version of CARDS 

Without considering the homogeneity assumption ([2]), there are many methods available 
for fitting model 0. Let be such a preliminary estimator. A very simple idea to 
generate homogeneity is as follows: first, rearrange the coefficients in (3 in the ascending 
order; second, group together those adjacent indices whose coefficients in (3 are close; 
finally, force indices in each estimated group to share a common coefficient and refit model 
([!]). A main problem of this naive procedure is how to group the indices. Alternatively, 
we can run a penalized least squares to simultaneously extract the grouping structure and 
estimate coefficients. To shrink coefficients of adjacent indices (after reordering) towards 
homogeneity, we can add fused penalties, i.e., {|/3i+i — /3i\,i = 1, • • • ,p — 1} are penalized. 
This leads to the following two-stage procedure: 

• Preordering: Construct the rank statistics {t(j') : 1 < j < p} such that fi T (j) is 
the jf'-th smallest value in 1 < % < p}, i.e., 

/3 T( i)</3 r(2) <---</3 T(p) . (4) 



• Estimation: Given a folded concave penalty function p\(-) |Fan and Li 2001] with 
a regularization parameter A, let 

3 = argmmj — \\y - X/3|| 2 + ^pa(|&-(,-+i) - PtV)\)}- (5) 

We call this two-stage procedure the basic version of CARDS (bCARDS). In the first 
stage, it establishes a data-driven rank mapping r(-) from the preliminary estimator j3. 
In the second stage, only "adjacent" coefficient pairs under the order r are penalized, 
resulting in only (p— 1) penalty terms in total. In addition, note that ^ does not require 
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that P T U) < /3 r (j_|_i). This allows coordinates in (3 to have a different order of increasing 
values from that in (3. 

With an appropriately large tuning parameter A, (3 is a piecewise constant vector in 
the order of r(-) and consequently its elements have homogeneous groups. In Section [3j 
we shall show that, if r is from a rank consistent estimate of (3°, namely 

# (1) <# (2) <•••<#(„), (6) 

then under some regularity conditions, (3 can consistently estimate the true coefficient 
groups of /3° with high probability. 

When px(-) is a folded-concave penalty function (e.g. SCAD, MCP), ^ is a non- 
convex optimization problem. It is generally difficult to compute the global minimum. 
The local linear approximation (LLA) algorithm can be applied to produce a certain 



local minimum for any fixed initial solution; see Zou and Li 2008], Fan et al. |2012 and 
references therein for details. 



2.2 Advanced version of CARDS 

To guarantee the success of CARDS, ^ is an essential condition. To be more specific, ^ 
requires that within each true group A^, the order of the coordinates can be arbitrarily 
shuffled, but for (i, j) belonging to different true groups, if fif < (3®, r(i) < t(j) must 
hold. This imposes fairly strong conditions on the preliminary estimator (3. For example, 
Q can be easily violated if ||/3 — /3 || oo is larger than the minimum gap between groups. 
To relax such a restrictive requirement, we now introduce an advanced version of CARDS, 
where the main idea is to use less information from /3 and to add more penalty terms in 
(§. 

We first introduce the ordered segmentation, which can be viewed as a generalized 
ordering. It is similar to letter grades assigned to a class. 

Definition 2.1. For a positive integer L, the mapping T : {!.,••■ ,p} —> {l,-" -L} ^ s 

called an ordered segmentation if the sets B\ = {1 < j < p : T(j) = I}, 1 < I < L, form a 
partition o/{l, ••• ,p}. 

Each set B[ is called a segment. When L = p, T is a one-to-one mapping and it defines 
a complete ordering. When L < p, only the segments {B\, • • • ,-Bl} are ordered, but the 
order of coordinates within each segment is not defined. 

6 



In the basic version of CARDS, the preliminary estimator (3 produces a complete rank 
mapping r. Now in the advanced version of CARDS, instead of extracting a complete 
ordering, we only extract an ordered segmentation T from (3. The analogue is similar to 
grading an exam: overall score rank (percentile rank) versus letter grade. Let 5 > be a 
predetermined parameter. First, obtain the rank mapping r as in Q and find all indices 
^2 < *3 < • • • < iL such that the gaps 

P T {j) ~ Pr(j-i) > 8, j = h, ■ ■ ■ , %l- 

Then, construct the segments 

fl/ = {T(i,),T(i, + l),..- ,r(i l+1 -l)}, l = l,---,L, (7) 

where i\ = 1 and ii+i = P + 1- This process is indeed similar to the letter grade that we 
assign. The intuition behind this construction is that when (3 T ^+i) — Pr(k) + 8, i.e., the 
estimated coefficients of two "adjacent coordinates" differ by only a small amount, we do 
not trust the ordering between them and group them into a same segment. Compared to 
the complete ordering r, the ordered segments {B%, ■ ■ ■ , Bl} utilize less information from 
(3. 

Given an ordered segmentation T, how can we design the penalties so that we can take 
advantage of the ordering of segments B\, ■ ■ ■ , B^ and at the same time allow flexibility of 
order shuffling within each segment? Towards this goal, we introduce the hybrid pairwise 
penalty. 

Definition 2.2. Given a penalty function p\(-) and tuning parameters X\ and A2, the 
hybrid pairwise penalty corresponding to an ordered segmentation T is 

L-i L 

Ptmm(P) = Y, E PA^IA-ftD + E E WIA-ftl). (8) 

1=1 ieB t jeBi +1 1=1 i,jeB t 

In Q, we call the first part between-segment penalty and the second part within- segment 
penalty. The within-segment penalty penalizes all pairs of indices in each segment, hence, 
it does not rely on any ordering within the segment. The between-segment penalty penal- 
izes pairs of indices from two adjacent segments, and it can be viewed as a "generalized" 
fused penalty on segments. 

When L = p, each B[ is a singleton and ^ reduces to the fused penalty in ([5]). On 
the other hand, when L = 1, there is only one segment B\ = {1, • • • ,p}, and Q reduces 
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to the exhaustive pairwise penalty 

Pl V {P)= £ PA(|A-&|). (9) 

l<i ,j<p 

It is also called the total variation penalty, and the case with p\(-) being a truncated L\ 



penalty is studied in Shen and Huang 2010 . Thus, the penalty rt8b is a generalization of 



both the fused penalty and the total variation penalty, which explains the name "hybrid" . 

Now, we discuss how the condition ([6]) can be relaxed. Parallel to the definition that 
r preserves the order of /3 , we make the following definition. 

Definition 2.3. An ordered segmentation T "preserves the order of (3 if maxjgB ; $ < 
mm jeB i+ i 09, for I = 1, • • • ,L — 1. 

By the construction Q, even if r does not preserve the order of (3°, it is still possible 
that the resulting T does. Consider a toy example where p = 4, and P^-m = @t(2) = 
/-r(4) < /^r(3) so * na * i T (^)' r (2)) r (4)} and {t(3)} are two true homogeneous groups in (3°. 
By definition of r, r ranks fi\ wrongly ahead of fi\ based on the preliminary estimate (3. It 
is obvious that r does not preserve the order of (3°. However, as long as /3 r a) < /3 T ( 3 ) + 5, 
r(3) and r(4) are grouped into the same segment in ([7]), say, B\ = {r(l),r(2)} and 
B2 = {r(3), r(4)}. Then T still preserves the order of /3° according to the above definition. 

Now we formally introduce the advanced version of Clustering Algorithm in Regres- 
sion via Data-driven Segmentation (aCARDS). It consists of three steps, where the first 
two steps are very similar to the way that we assign letter grades based on an exam 
(preliminary estimate). 

• Preliminary Ranking: Given a preliminary estimate f3, generate the rank statis- 
tics {r(j) : 1 < j < p} such that /3 r(1) < /3 r(2 ) < < /3 t(p) . 

• Segmentation: For a tuning parameter 5 > 0, construct an ordered segmentation 
T as described in 0. 

• Estimation: For tuning parameters Ai and A2, compute the solution f3 that mini- 
mizes 

Q„(/3) = ^||y-X y 3|| 2 + P TiAliA2 (/3). (10) 

In Section [ZJ we shall show that if T preserves the order of /3°, under certain condi- 
tions, (3 recovers the true homogeneous groups of /3° with high probability. Therefore, 



to guarantee the success of this advanced version of CARDS, we need the existence of a 
5 > for the initial estimate such that the associated T preserves the order of (3°. We 
see from the toy example that even when Q fails, this condition can still hold. So the 
advanced version of CARDS requires weaker conditions on (3. The main reason is that 
the hybrid penalty contains penalty terms corresponding to more pairs of indices. Hence, 
it is more robust to possible mis-ordering in r. In fact, the basic version of CARDS is a 
special case with 5 = 0. 

2.3 CARDS under sparsity 

In applications, we may need to explore homogeneity and sparsity simultaneously. Often 
the preliminary estimator (3 takes into account the sparsity, namely it is obtained with 
a penalized least-squares method |Fan and Li , 2001, Tibshirani et al. 2005 



or sure in- 



dependence screening Fan and Lv, 2008 . Suppose (3 has the sure screening property, 
i.e., So C S with high probability, where S and Sq denote the support of /3 and /3°, 
respectively. We modify CARDS as follows: In the first two steps, using the non-zero 
elements of (3, we can similarly construct data-driven hybrid penalties only on coefficients 
of variables in S. In the third step, we fix /3g c = and obtain /3g by minimizing the 
following penalized least squares 

Qsparse^ = _ Xg/%|| 2 + P W2 (/%) + X>(|&|), (U) 



where Xg is the submatrix of X restricted to columns in S. In (11), the second term 
is the hybrid penalty to encourage homogeneity among coefficients of variables already 
selected in /3, and the third term is the element- wise penalty to help further filter out 
falsely selected variables. We call this modified version the shrinkage- CARDS (sCARDS). 



3 Analysis of the basic CARDS 

In this section, we analyze theoretical properties of the basic CARDS. For simplicity, 
we assume that there is no group of 0, i.e., the usual sparsity is not explicitly explored. 
We first provide heuristics to two essential questions: (1) How does it help reduce the 
convergence rate of \\f3 — (3°\\ by taking advantage of homogeneity? (2) What is the order 
of minimum signal strength required for recovering the true groups with high probability? 
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We then formally state our main results. After that, we will give conditions under which 
the ordinary least squares can provide a good preliminary estimator, as well as the effect 
of mis-ranking on CARDS. 



3.1 Heuristics 

Consider an ideal case of orthogonal design X T X = nl p (necessarily p < n). The ordinary 
least-square estimator p = (X X) X y has the decomposition 

W S = + ^ e i ' N^n- 1 ), j = !,-■■ ,p. 

It is clear by the square-root law that ||/3 — f3°\\ = Op(y / p/n). Now, if there are K 
homogeneous groups in /3° and we know the true groups, the original model ([I]) can be 
rewritten as 

y = X A (3 A + e, 

where j3 A = (/3 A 1? • • ■ ,/3 A K ) T contains distinct values in (3°, and X^ = (x^i, • • • , x^^) 

_^ ^ols 
with XA,k = z2jeA k x j- The corresponding ordinary least-squares estimator f3 A = 

(X^X J 4)~ 1 X^y has the decomposition 

PA%-P%k + £k, e k ~N(0, ), and e fc 's are independent. (12) 

i — OVQiCIg 

Here = o-r Y^jeA k e j 1S * ne n °i se averaged over group k. The oracle estimator (3 



\A k \ ^j£A k 

3 j '.LA' 



is defined such that f3°. racle = /3° ls for all j £ A},, Then, by the square-root law, 



||/3 -/3°|| 2 



-^■oracle 

which implies immediately that \\p — p 




O p {K/n), 



The surprises of the results are two fold: First, the rate */ K/n is for \\(3 — (3 



oracle „q 



instead of ||3a — Pa\\- The former can be viewed as duplicate counts of the terms in the 
latter, hence it can be much larger than the latter. However, since there are K parameters 
in f3 A , common heuristics in regression analysis give \\/3 A — (3 A \\ = O p (y/ K/n), and so 
the convergence rate of ||/3 — j3°\\ should be much larger than ^jKjn. The above 



results seem to be counter-intuitive. The point is that in (12) the noises are averaged, 



and so the rate of \\/3a — 0a\ is much smaller than a/ K/n. In fact, by taking advantage 
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of homogeneity, we not only estimate much fewer parameters, but also reduce the noise 
level. 

The second surprise is that the rate has nothing to do with the sizes of true homo- 
geneous groups. No matter whether we have K groups of equal size, or one dominating 
group and {K — 1) very small groups, the rate is always the same in the oracle situation. 
This is also a consequence of noise averaging. 

^oracle 

Next, we discuss when the CARDS estimator equals the oracle estimator (3 that is 
based on the knowledge of the true grouping structure. For simplicity, we still consider the 
case of orthogonal design X T X = nip, and assume the preliminary ordering r preserves 
the order of /3° so that the basic version of CARDS works. Write r(j) = j without loss 
of generality. CARDS finds a local solution of 

p-i 



Qn{(3) = ^-\\y-Xf3f + Y,P*(\Pi+i-Pj 



i=i 

1 1 
= -\\y - Xz|| 2 + - ||z - /3|| 2 + 5>a(I&+i - &|), 

U i=i 
where z = n~ 1 ~K T y is the vector of marginal correlations (when y is also normalized). As 

— otclcIg. 

a result, if the estimator produced by CARDS is the oracle estimator (3 , necessarily 

■~~oracle 

(3 has to satisfy the KKT condition 

-Oi - p° rade ) - p x (<$° racle ) = o, 

-{zj-^f acle )+p x {dpf ade )-px{d$° r ^ le ) = ^ 2<j<p-l, (13) 

-( Zp - p; racle ) + p x {c$° rade ) = o, 

--^oracle -^oracle -^.oracle _ , . 

where ap,j = (3j — p J _ 1 for 2 < j < p; and p\(t) = p x (\t\)sgn(t) with sgn(t) = 1 
for t > 0, — 1 for t < 0, and any value on [—1, 1] for t = 0. Write the true groups as Ak = 
{jk,jk + !,••• Jk+i ~ 1}, 1 < k < K, for some 1 = j x < j 2 < ■ ■ ■ < j K < jx+i =p + l. 



It is not hard to show that the sufficient and necessary conditions for (13) to hold are 

(14) 



px(d(3° r k ade ) = 0, 2<k<K, 



EL 3k (PT ade -*i)\< Pa(°+)» l<k<K, j k <j< 3k+1 - 1. 

^oracle 



Here d(3j is the estimated coefficient gap between groups Ak-i and Ak in the oracle 



3 k 

estimator, and it is equal to df3j k +6^ — e^_x, where d/3j k is the true coefficient gap between 
groups Ak^i and A^. Also, f3j racle — Zj = £k — £j for j G A^, which is purely determined by 



the noises. Therefore, to guarantee (14), the penalty function px(-) must have flat tails, 
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i.e., = when \t\ > aX (a > is a constant); furthermore, the true coefficient gaps 

{d/3j k : k = 2, • • ■ , K}, the tuning parameter A and the noises {ej} need to satisfy, say, 

' mm 2 < k < K \dPl\ > 2(a + l)A, 

maxi< fc <x |efc| < A, (15) 

Note that j/ A (0+) = A for most sparsity penalty functions; and e k is much smaller than 



maxj gj 4 fc \ej\ with high probability. So (15) requires that the minimum true coefficient 
gap between groups satisfies 

j 

min \df$ I > C max max I > eA. (16) 

2<k<K Jk l<k<K jeA k [ ^ J > 

Using results in Darling and Erdos] [1956 , the right hand side of (16) is upper bounded 



by C maxfc |y ^ fc l log( -^ log ^ j4fc ^ j with high probability, for a sufficiently large constant 
C. Therefore, for CARDS to produce the oracle estimator, the minimum coefficient gap 
between true groups should be at least in that order. Up to a logarithmic factor, we write 
this order as maxfcly^Afcl log(p)/n}. 

3.2 Notations and regularity conditions 

Let Ma be the subspace of W defined by 

M A = W G M p : ft = /%, for any i, j G A k , 1 < k < K}. 

For each j3 G Ma, we can always write X/3 = where is an n x K matrix 

with X-A(i,k) = YljeA k with X(i,j) denoting the (i, j)-element of X, and f3 A is 

a K x 1 vector with its A;th component /3a, k being the common coefficient in group A k . 
Define the matrix D = diag(| Ai | 1//2 , • • • , |yli<-| 1//2 ). We introduce the following conditions 
on the design matrix X: 



Condition 3.1. ||xj|| = y/n, for 1 < j < p. The eigenvalues of the matrix 
^D^X^X^D -1 are bounded below by c\ > and bounded above by C2 > 0. 

In the case of orthogonal design, i.e., ^X T X = I p , the matrix ^D^X^X^D -1 simplifies 
to Ik, and c\ = C2 = 1. 

Let p{t) = \~ l p\{t) and p{t) = p'(|i|)sgn(t). We assume that the penalty function 
Px(-) satisfies the following condition. 
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Condition 3.2. px(-) is a symmetric function and it is non-descreasing and concave on 
[0,oo). p'(t) exists and is continuous except for a finite number of t with p'(0+) = 1. 
There exists a constant a > such that p(t) is a constant for all \t\ > aX. 

We also assume that the noise vector e = (ei, • • • , e p ) T has sub-Gaussian tails. 

Condition 3.3. For any vector a£l" and x > 0, P(|a T e| > ||a||x) < 2e~ C3x2 , where C3 
is a positive constant. 

Given the design matrix X, let be its submatrix formed by including columns in 
Ak, for 1 < k < K. For any vector v G W 1 , let DC(v) = maxi<j< g |uj — q^ 1 Y^j=i v j\ ^ e 
the "deviation from centrality". Define 

o~k = A max (^Xf , X fc ) and v k = max DC(^X^X/x), (17) 

/x6.Ma:||HI=i 

where A max (-) denotes the largest eigenvalue operator. In the case of orthogonal design, 
a k = 1 and v k = 0. Let b n = \ mmi<k<i<K |/3^ fe — J denote the minimal gap between 
two groups in /3°, and A = A n the tuning parameter in the penalty function. 

3.3 Main results 

When the true groups A\, ■ ■ ■ , Ax are known, the oracle estimator is 

r^oracle . ( 1 



-^oracle ( 1 ,0112 1 

a = are; min < — «-Xp >. 



Theorem 3.1. Suppose Conditions 3.1-3.3 hold, K = o(n), and the preliminary estimate 
f3 generates an order r that preserves the order of f3° with probability at least 1 — eo- // 
b n > a\ n and 

X n > max jvVfcl^fcl log(p)/n + (1 + u k \A k \)y/ 'K\og{n) /raj , (18) 

* -i ^oracle 

then with probability at least 1 — eo — n K — 2p , (3 is a strictly local minimum of 
(|5j). Moreover, orade - {3°\\ = O p {^K/n). 



Theorem 3.1 shows that there exists a local minimum of ^ which is equal to the oracle 



estimator with overwhelming probability. This strong oracle property is a stronger result 
than the oracle property in Fan and Li, 2001| . 



The bCARDS formulation ^ is a non-convex problem and it may have multiple local 
minima. In practice, we apply the Local Linear Approximation algorithm (LLA) [Zou 
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and Li 



2008 



•~(0) --.initial 

to solve it: start from an initial solution p = p ; at step m, update 



solution by 

i>-> 

^nl^-\\y-X(3\\ 2 + 



3 M = argmm{^||y - X/3|| 2 + |> A ($£;J> - ^\) ■ - P rU) \}. 



3=1 
^initial 

Given f3 , this algorithm produces a unique sequence of estimators which converge to 



a certain local minimum. Theorem 3.2 shows that under certain conditions, the sequence 



of estimators produced by the LLA algorithm converge to the oracle estimator. 



Theorem 3.2. Under conditions of Theorem 3.1, suppose p'(X n ) > oo f° r some constant 

■-•initial ■ — ■ ^-initial ^.On 

ao > 0, and there exists an initial solution p of (151) satisfying \\p — p < 

A n /2. Then with probability at least 1 — eo — n~ l K — 2p _1 , the LLA algorithm yields 

■-^oracle . ---oracle 

p after one iteration, and it converges to p after two iterations. 

The L\ penalty p(t) = \t\ is widely used in high-dimensional penalization methods 
partially due to its convexity. For example, it can be used here to get the initial solution 
f3 for the LLA algorithm. However, this penalty function is excluded in Condition 



3.2, and consequently Theorem |3.1| does not apply. Now, we discuss the L\ penalty in 
more details. 

We first relax the requirement that r preserves the order of Instead, we consider 
the case that r is "consistent" with coefficient groups in (3°, that is, for any two variables 
in the same true group, variables ranked between them are also in this group (if r preserves 
the order of /3°, r belongs to this class). Note that we do not require fi®u\ < ^r(j) ^ or au 
i < j. In this case, recovering the true groups is equivalent to locating jumps (which can 
have positive or negative magnitudes) in /3°. 

Below we introduce an "irrepresentability" condition. For k = 1, ■ ■ ■ , K — 1, write 
d^Ak = &A k+i ~ @Ak- Define the K- dimensional vector do by d® = sgn(df3 A1 ), d° K = 
-sgn(d/3^ ^_ 1 ) and 

d° k = sgn(d(3 Ak ) - sgn(d/3° ifc _ 1 ), 2 < k < K - 1. 

Here d° is the adjacent difference of the sign vector of jumps in /3°. For example, suppose 
K = 4 and the common coefficients in 4 groups satisfy f3 A 2 — (3 A 1 > 0, (3 A 3 — (3 A 2 < 
and f3 A 4 — f3 A 3 > 0. Then d° = (1, —2, 2, —1). Also, define the p-dimensional vector 

b° = x T x A (x5x A )- 1 d°. 
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In the case of orthogonal design X T X = nl p , b° G M.a and it has the form b® = l/\A k \ 
for j G A}.. For each j G A k , let 

Alj = Mi) eA k :i< j}, A 2 kj = {r{i) G A k : i > j}. 

Namely, A k contain indices in group k that have ranks < j in the mapping r, and A^ 
contain those have ranks > j. Write 6 k j = \A l k ^\/\A k \ as the proportion of indices in group 
k which is mapped in front of (and including) r(j). Denote b k j = j^rjj Z^^g^y ^r(i) 
the average of elements in b° over the indices in A\? , and 6 t ,- = , \ ,, V ... .1 ,■ b° r , the 
average of elements in b° over the indices in A^ . The following inequality is called the 
"irrepresentability" condition on X and (3°: for any 1 < k < K and j G A^, j 7^ jk+i — 1> 



1 - co n > 

\e ljS gn(df3 A1 ) + |Ai| 2 y (l - e^ihj - b^l 

|(1 - ^ i )sgn(d^ ifc _ 1 ) + e fci sgn(d^ )Jfe ) + |^ fc | 2 ^(l - - |> 2 < k < K 

|(1 - iCi )sgn(^O jK _ 1 ) + \A K \ 2 e Kj (l - 9 Kj )(b Kj - b Kj )\. 
Here {co n } is a positive sequence, which can go to 0. In the case of orthogonal design, 
b° G M-a and b k j — b k j = holds for all k and j G A k . The "irrepresentability" condition 
reduces to 

f |^sgn(d/30 1 )|, 
1 - "n > I |(1 - ^•)sgn(d/3° ifc „ 1 ) + e fci sgn(d/^ ifc )|, 2 < k < K - 1, 

[ \{i-e K j)*&{d^ K _ x )\. 

This is possible only when 

sgn^^) sgn(df3% k ), 2 < k < K - 1. (20) 
Noting that l/|^4fc| < < 1 — the associated ui n can be chosen as minfc{l/|^4fc|} 



when (20) holds. 

Theorem 3.3. Suppose Conditions 3.1 and 3.3 hold, the "irrepresentability" condition 



(19) is satisfied, K = o(n), and the preliminary estimate (3 generates an order r that is 
consistent with (3° with probability at least 1 — cq. If b n and X n satisfy 



6 n > yjK log(n)/n + X n ( ^ j , A n > max {Vofcl^fcl log(p)/n| , (21) 

fc=i 

i/ien urai/i probability at least \ — eo — n~ 1 K — 2p~ 1 , §5§ has a unique global minimum (3 such 
that (3 G Ma and it satisfies the sign restrictions sgn((3A,k+i — PA,k) = sgn((3 A k+1 — (3 A k ), 
k = l,--- ,K-1. Moreover, - (3°\\ = O p (^/K/rl + ln ), where 7n = A„( £f=i ^) 1/2 - 
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Compared to Theorem 3.1 there is an extra bias term in the L2 estimation error. We 



consider an ideal case where the sizes of all groups have the same order s/K, the sequence 



w„ > oj for some positive constant oj, and max^ o~k < C. From (21 ), the magnitude of the 
bias term is \JK log(p)/n, which is much larger than \J K/n. So in the L\ penalty case, it 
is generally hard to guarantee both exact recovery of the true grouping structure and the 
\J -fT/n-convergence rate of ||/3 — /3°||. Moreover, the "irrepresentability" condition is very 



restrictive, even in the orthogonal design case. From (20), in order to exactly locate all 
jumps, necessarily all consecutive jumps (in the ordering r) have opposite signs. However, 
this is sometimes hard to guarantee. Especially when r preserves the order of (3°, all the 
jumps have positive signs. 

3.4 Preliminary estimator, effects of mis-ranking 

We now give sufficient conditions under which the least-squares estimator induces an 
order-preserving rank. When sparsity is explored, after the model selection consistency 
|Fan and Lv 2011 Fan et al. 2012 , the problem becomes a dense problem. Hence, the 



fundamental insights can be gained when the coefficients are not sparse and it will be the 
case that we focus upon next. 

The ordinary least squares estimator 

■^.ols C 1 o ^ 

p = arg min <^ — \\y — X8\\ >, 
can be used as the preliminary estimator. The following theorem shows that it induces a 



rank preserving mapping that satisfies Theorem 3.1 



Theorem 3.4. Under Condition 3.3, suppose p < n and ||(X T X) 1 || max < c^n 1 for 
some constant C4 > 0. Ifb n > (2C4/C3) \og(p)/n, then with probability at least 1 — 2p l , 

^OlS Q 

the order generated from (3 preserves the order of (3 . 

When the order r extracted from /3 does not preserve the order of /3°, the penalty in 
([5]) is no longer a "correct" penalty for promoting the true grouping structure. There is 
no hope that local minima of §5§ exactly recover the true groups. However, if there are 
not too many misordering in r, it is still possible to control \\(3 — /3° ] | . 

Given an order r, define K*(t) = Y^jZi ^{^(j) ^ @T(j+i)}' wrncri is the number of 
jumps in (3° in the ordering r. These jumps define subgroups A^A^,-- - ,A' K *, each 
being a subset of one true group. Although different subgroups may share the same true 
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coefficients, consecutive subgroups, A' k and A' k+1 , have a gap in coefficient values. As a 
result, the above results apply to this subgrouping structure. The following theorem is a 



direct application of the proof of Theorem 3.1 and its details are omitted 



Theorem 3.5. Suppose Conditions 3.1-3.3 hold, K*(r) = o{n), b n > a\ n and X n satisfies 
(18). Then with probability tending to 1, there is a strictly local minimum (3 of ^ such 



that ||3 - /3°|| = O p {^K*(T)/n). 

4 Analysis of the advanced CARDS 

In this section, we analyze the advanced version of CARDS described, as well as its variate 
the shrinkage-CARDS. 

4.1 Main results 

To guarantee the success of the advanced CARDS, a key condition is that the ordered 
segmentation preserves the order of /3° . This implies restrictions on how much the ordering 
(in terms of increasing values) of coordinates in /3 deviates from that of /3°. This is 
reflected on how the segments {B±, ■■■ , Bl} intersect with the true groups {Ai, • • • , Ak}- 
Write Vki = Ak H B\. We have the following proposition: 

Proposition 4.1. When T preserves the order of (3° , for each k, there exist dk and Uk 
such that Ak = Ud k <i< Uk Vkb and Vki = -£>/ for dk < I < Uk- For each I, there exist ai and 
bi such that B t = U ai < k < bl V k i, and V H = A k for ai <k <b h 



Proposition 4.1 indicates that there are two cases for each A^. either Ak is contained 
in a single Bi or it is contained in some consecutive B^s where except the first and last 
one, all the other B^s are fully occupied by A k . Similarly, there are two cases for each 
B[\ either it is contained in a single Ak or it is contained in some consecutive AkS where 
except the first and last one, all the other Ak's are fully occupied by B\. 

Theorem 4.1. Suppose Conditions 3.1-3.3 hold, K = o(n), and the preliminary estimate 
(3 and the tuning parameter 5 n together generate an ordered segmentation T that preserves 
the order of f3° with probability at least 1 — eo- If b n > amax{Ai n , \2n}, 

Ai„ > maxll^- 2 [v^fcl^fcl log(p)/" + (1 + v k \A k \)y/Klog(n)/n] } , (22) 

k,h K L J J 
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and 

X 2n > max { y/\og(p)/(n\A k \) + v k log(n)/(n|A fe |)} , (23) 

^oracle , 

then with probability at least 1 — eo — 0(n ), (3 is a strictly local minimum of (10). 

,, ^oracle n ,, „ , / > — > 

Moreover, ||/3 - (3 \\ = O p (y/K/n). 



Compared to Theorem 3.1 the advanced version of CARDS not only imposes less restric- 
tive conditions on (3, but also requires a smaller minimum gap between true coefficients. 
Next, we establish the asymptotic normality of the CARDS estimator. By Theorem 



4.1 with probability tending to 1, the advanced CARDS performs as if the oracle. In the 
oracle situation, for example, if p = 5 and f}\ = fy, fis = the accuracy of estimating 
(3 is the same as if we know the model: 

Y = fii (X 1 + X 4 ) + (3 2 X 2 + /3 3 (X 3 + X 5 ) + e. 



Theorem 4.2. Let (3 be any local minimum of (10) such that \\/3 — f3 \\ < Cy/Klog(n)/n 



for a large constant C > with probability at least 1 — o(l). Under conditions of Theorem 



4-1, if HX^X^X/i) 1 / 2 || 00 = 0(1), then for a fixed positive integer q, and any sequence 
{B n } such that B n G R qxK , ||B^|| 2 ,oo = o(l) and B n B^ -> H, where H is a fixed q x q 
positive definite matrix, we have 

B„(X5X A ) 1 / 2 (3 A - (3° A ) A N(0, H), 

where (3^ is the K -dimensional vector of distinct values in (3. 

In the case of orthogonal design X T X = nl, the matrix X^X^X^) -1 / 2 has orthonor- 
mal columns, so it is reasonable to assume HX^X^Xa)- 1 / 2 !!^ = 0(1). In addition, 
when all the entries of B n have the same order, || B^" || 2,00 = 0(l/y/~K) = o(l), as long as 
K ->■ 00. 

^ ^ols 

To compare the asymptotic variance of (3 and (3 , we introduce the following corollary. 



Corollary 4.1. Suppose conditions of Theorem 4-2 hold and let (3 and (3 be the ordinary 



least squares estimator and CARDS estimator respectively. Let M n be the p x K matrix 
with M n (j,k) = (l/|Afc| 1 / 2 )l{j G A^}. For any sequence of p- dimensional vectors a n , 

v^\ T n 0° lS ~ (3°) 4 N(0, 1) and i£ /2 a£(3 - /3°) 4 iV(0, 1). 

where v\ n = a^(X T X) _1 a n and t>2 n = a^M^(M^X T XM n )~ 1 M n a Tl . In addition, v\ n > 
V2n- 
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4.2 CARDS under sparsity 



In Section 2.3 we introduced the shrinkage-CARDS (sCARDS) to explore both homo- 
geneity and sparsity. In sCARDS, given a preliminary estimator (3 and a parameter 5, we 
extract segments B\, ■ ■ ■ , Bl such that \jf =1 Bi = S, where S is the support of /3. Denote 
Bq = {j : f3j = 0}. In this case, we say T = {Bo, B\, ■ ■ ■ , Bl} preserves the order of /3° 
if maxjgB \/3j \ = 0, and maxj g £ ; < minj g £ i+1 f3j, for I = 1, • • • , L — 1. This implies 
that j3 has the sure screening property; and on those preliminarily selected variables, the 
data-driven segments preserve the order of true coefficients. In particular, from Proposi- 
tion 4.1, those falsely selected variables, i.e., {j : [3® = 0, f3j / 0}, should be contained in 
either a single segment or some consecutive segments. 

Suppose there is a group of zero coefficients in /3°, namely, A = (Aq,A\,--- ,Ak)- 
Let M* A be the subspace of W defined by 

M* A = {/3Gl p :ft=0, for any i G A Q ; fa = fy, for any i, j G A k , 1 < k < K}. 

Denote the support of f3° as S and s = \S\. The following theorem is proved in Section [8j 

Theorem 4.3. Suppose Conditions 3.1-3.3 hold, s = o{n), log(p) = o(n), and the prelim- 
inary estimate (3 and the tuning parameter 5 n together generate an ordered segmentation 
T that preserves the order of (3° with probability at least 1 — ep. If b n > amax{Ai n , \2n}, 



min{|/3°| : j3°- / 0} > 2a\ n , Ai n and \2 n satisfy (22) -(23) and X n » y/log(p)/n, then 



oracle 



with probability at least 1 — eo — n 1 K — 2p 1 , (3 is a strictly local minimum of (11). 

,, ^oracle n ,, „ , / > — > 

Moreover, ||/3 - P \\ = O p (y/K/n). 



The preliminary estimator (3 can be chosen, for example, as the SCAD estimator 

v 

.2n' 



d 1 P 

p SCad G arg min { — \\y - X/3|| 2 + £ p A , (|& | ) } , (24) 



where Pa'(') * s the SCAD penalty function Fan and Li [2001 . The following theorem is a 



direct result of Theorem 2 in Fan and Lv [201 1 , and the proof is omitted. 



Theorem 4.4. Under Condition 3.1 and 3.3, if s = o(n), X' n S> ra~ 1//2 [log(n)] 2 and 
mm{\(3j\ : (3® ^ 0} 3> n -1 / 2 max { yl~og~p, || ^X^ c X5|| oov / log n}, i/ien with probability at 
least 1 — o(l), there exists a strictly local minimum (3 and 5 n = 0(log(n)/n) which 
together generate a segmentation preserving the order of f3 . 
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5 Simulation studies 



We conduct numerical experiments to implement two versions of CARDS and their variate 
sCARDS. The goal is to investigate the performance of CARDS under different situations: 
Experiment 1 and 2 are based on the linear regression setting Yj, = X^/3° + e,,, where in 
Experiment 1 only the homogeneity is explored, and in Experiment 2 the homogeneity 
and sparsity are explored simultaneously. Experiment 3 is based on the spatial-temporal 
model Y it = X.J$ + e it . 

In all experiments, {Xj : 1 < i < n} or {Xt : 1 < t < T} are generated independently 
and identically from the multivariate standard Gaussian distributions, and {ej : 1 < i < n} 
or {en : 1 < i < p,l < t < T} are IID samples of A^(0, 1). All results are based on 100 
repetitions. 

Example 1: Consider the linear regression setting with p = 60 and n = 100. Predic- 
tors are divided into four groups with each group having a size of 15. The four different 
values of the true regression coefficients are — 2r, — r, r and 2r, respectively. Here different 
values of r > lead to various signal-to-noise ratios. 

We compare the performance of six different methods: Oracle, ordinary least squares 
(OLS), bCARDS, aCARDS, total variations (TV), fused Lasso (fLasso). Oracle is the 
least squares estimator knowing the true groups. aCARDS and bCARDS are described in 
Section [2j here we let the penalty function p\{) be the SCAD penalty with a = 3.7, and 
take the OLS estimator as the preliminary estimator. TV uses the exhaustive pairwise 
penalty ([9]) with p\(-) being the same as that in aCARDS and bCARDS. The fused Lasso 
is based on an order generated from ranking the OLS coefficients. Tuning parameters of 
all these methods are selected via Bayesian information criteria (BIC). 

Performance is evaluated in terms of the average prediction error over an independent 
test set of size 10, 000. In addition, to measure how close the estimated grouping structure 
approaches the true one, we introduce the normalized mutual information (NMI), which 
is a common measure for similarity between clusterings Fred and Jain [2003| . Suppose 



C = {Ci, C2, • • • } and O = {Di, D2, ■ ■ ■ , } are two sets of disjoint clusters of {1, • • • ,p}, 
define 

NMI(C D) = 

where I(C;B) = J2kj(\Ck H Dj\/p)log(p\Ck fl Dj\/\Ck\\Dj\) is the mutual information 
between C and D, and H(C) = £ fc (|Cfc|/p) log(|C fe |/p) is the entropy of C. NMI(C,D) 
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Table 1: Medians of the average prediction error over 100 repetitions for Experiment 1. 





Oracle 


OLS 


bCARDS 


aCARDS 


TV 


fLasso 


r 


= 1 


1.0355 


1.6112 


1.0504 


1.1182 


1.4847 


1.4253 


r= 


=0.9 


1.0273 


1.5885 


1.0479 


1.1048 


1.4608 


1.4186 


r= 


=0.8 


1.0359 


1.5947 


1.0826 


1.1786 


1.4777 


1.4427 


r= 


=0.7 


1.0311 


1.6038 


1.1250 


1.2830 


1.5591 


1.4625 


r= 


=0.6 


1.0370 


1.6054 


1.3172 


1.4586 


1.5795 


1.4824 


r= 


=0.5 


1.0347 


1.5826 


1.3645 


1.5734 


1.5734 


1.4668 


Table 2: Medians of NMI 


over 100 repetitions for Experiment 1. 




Oracle 


OLS 


bCARDS 


aCARDS 


TV 


fLasso 


r 


=1 


1.0000 


0.5059 


0.9414 


0.9784 


0.7203 


0.6503 


r= 


=0.9 


1.0000 


0.5059 


0.9414 


0.9784 


0.7167 


0.6521 


r= 


=0.8 


1.0000 


0.5059 


0.8609 


0.9355 


0.7245 


0.6549 


r= 


=0.7 


1.0000 


0.5059 


0.7912 


0.8989 


0.6991 


0.6458 


r= 


=0.6 


1.0000 


0.5059 


0.7008 


0.8763 


0.6808 


0.6373 


r= 


=0.5 


1.0000 


0.5059 


0.6722 


0.6741 


0.6654 


0.6251 



takes values on [0, 1], and large NMI implies that the two grouping structures are close. 

Table [T] shows medians of the average prediction error for six different methods under 
various values of r. Table [2] shows medians of NMI. The boxplots are displayed in Figure 
[TJ We see that except for the case of weak signals (r == 0.5), two versions of CARDS 
outperform other methods in terms of smaller prediction error and larger NMI. bCARDS 
is especially good in achieving low prediction errors, even in the case r == 0.5. aCARDS 
has a better performance in NMI, which shows that it is good in recovering the true 
grouping structure. 

Experiment 2: Consider the linear regression setting with p == 100 and n == 150. 
Among the 100 predictors, 60 are important ones and their coefficients are the same as 
those in Experiment 1. Besides, there are 40 unimportant predictors whose coefficients 
are all equal to 0. 

We implemented sCARDS in this setting and compared its performance to different 
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oracle estimators, Oralce, OracleO and OracleG, as well as ordinary least squares (OLS) 
and the SCAD estimator. The three oracles are defined with different prior information: 
The Oracle knows both the important predictors and the true groups among them; the 
OracleO only knows which are important predictors; and the OracleG only knows the 
true groups (it treats all unimportant predictors as one group with unknown coefficients). 
sCARDS is as described in Section [2j when implementing it, we take the SCAD estimator 
as the preliminary estimator. 

Table [3] shows medians of the average prediction error, number of false positives and 
normalized mutual information on grouping important predictors. Figure [2] displays the 



o 1.6 
<D 

£ 1.4 
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Oracle OLS bCARDS aCARDS TV fLasso 



Oracle OLS bCARDS aCARDS TV fLasso 



(a) r=l 



Oracle OLS bCARDS aCARDS TV fLasso 



Oracle OLS bCARDS aCARDS TV fLasso 



(b) r=0.8 



Oracle OLS bCARDS aCARDS TV fLasso 



Oracle OLS bCARDS aCARDS TV 



fLasso 



(c) r=0.5 

Figure 1: Boxplots of the average prediction error and normalized mutual information 
over 100 repetitions in Experiment 1. 
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Table 3: Medians of the average prediction error (PE), number of false positives (FP) 
and NMI on the important variables, over 100 repetitions for Experiment 2. 





Oracle OracleO OracleG OLS SCAD sCARDS 


PE 


r=l 
r=0.7 


1.0234 1.3869 1.0273 1.6758 1.4333 1.0895 
1.0204 1.3961 1.0274 1.6544 1.4330 1.0960 


FP 


r=l 
r=0.7 


40 40 5 1 
40 40 4 2.5 


NMI 


r=l 
r=0.7 


1.0000 0.5059 1.0000 0.5059 0.5059 1.0000 
1.0000 0.5059 1.0000 0.5059 0.5059 1.0000 




Oracle OracleO OracleG OLS SCAD sCARDS Oracle OracleO OracleG OLS SCAD sCARDS 



(a) r=l (b) r=0.7 

Figure 2: Boxplots of the average prediction errors over 100 repetitions in Experiment 2. 



boxplots of average prediction errors under different values of r. First, by comparing 
prediction errors of the three oracles, we see a significant advantage of taking into ac- 
count both homogeneity and sparsity over pure sparsity. Moreover, the results of OracleO 
and OracleG show that exploring group structure is more important than sparsity. Sec- 
ond, sCARDS achieves a much smaller prediction error than that of OLS and SCAD. 
Third, compared to the preliminary estimator SCAD, sCARDS can further filter out 
falsely selected unimportant variables. Fourth, sCARDS successfully recovers the group- 
ing structure on important variables in most cases (NMI = 1 means the estimated groups 
exactly overlap with the true ones). 

Experiment 3: We consider a special case of the spacial-temporal model, where 
Xj£ = Xt for i = 1, ■ ■ ■ ,p, i.e., the predictors are common for all spacial locations, p = 100 
is the total number of locations. Each (3 i is a 5-dimensional vector. In each coordinate 
j = 1, • • ■ , 5, the coefficients 1 < i < 100} are divided into four groups of equal size 
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Table 4: Medians of the average prediction error and NMI over 100 repetitions for Ex- 
periment 3. 







Prediction 


Error 


NMI 






Oracle 


OLS 


aCARDS 


Oracle 


OLS 


aCARDS 


T= 


--20 


1.0095 


1.2501 


1.1898 


1.0000 


0.4628 


0.8154 


T= 


--50 


1.0034 


1.0990 


1.0170 


1.0000 


0.4628 


0.9803 


T= 


--80 


1.0025 


1.0625 


1.0067 


1.0000 


0.4628 


0.9851 



25, with coefficients in the same group sharing a same value. In coordinate 1, the four 
true coefficients are [—2, —1, 1, 2]; in coordinate j = 2, ■ ■ ■ ,5, they are [—2, —1, 1, 2] +0.1 x 
(J-1). 

We extend aCARDS (bCARDS) to this model: given a preliminary estimator, for 
each coordinate j = 1, • • • , k, extract the data-driven segments (ordering) and build the 
cross-sectional hybrid (fused) penalty Pj(-), then sum them up to build the penalty term, 
and finally solve a penalized maximum likelihood: 

.. p T k 
/ 3=^..,^)-(b 1 ,..,b t )l2T^^ 1 < PJ ^ J 

We still call the method aCARDS (bCARDS). The Oracle is the maximum likelihood es- 
timator knowing the true groups in each coordinate. We aim to compare the performance 
of Oracle, OLS and aCARDS. 

Table [4] shows medians of the average prediction error and normalized mutual informa- 
tion(averaged over 5 coordinates). Instead of varying the signal-to-noise ratio directly, we 
equivalently change T, the total number of time points. Figure [3] contains the boxplots. 
We see that aCARDS achieves significantly lower prediction errors in all cases. More- 
over, aCARDS estimates well the true grouping structure; in particular, when T = 50, 80, 
NMI > 0.95 in most repetitions. 
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Figure 3: Boxplots of the results over 100 repetitions in Experiment 3. 



6 Real data analysis 
6.1 S&P500 returns 

In this study, we fit a homogeneous Fama-French model for stock returns: Yn = ai+~X.J j3® , 
where Xj contains three Fama-French factors at time t and Yu is the excess return of 
stocks. We collected daily returns of 410 stocks, which were in the components of the 
S&P500 index in the period December 1, 2010 to December 1, 2011 (T = 254). We applied 
bCARDS as in Experiment 3, except that the intercepts ay's were also penalized. The 
tuning parameters were chosen via generalized cross validation (GCV). Table [5] shows the 
number of fitted coefficient groups on three factors and the number of non-zero intercepts. 
We then used the daily returns of those stocks in the period December 1, 2011 to July 2, 
2012 (T = 146) to evaluate the estimation error. Let yu and yu be the fitted and observed 
excess returns of stock i at time t = 1, ■ ■ ■ , 146, respectively. Define the cumulative sum 
of squared estimation errors at time t as cRSSt = Y^ s =i p'- s / 10 ' YliiVit ~ Hit) 2 , where p 
is a chosen constant between and 1. Here we take p = 0.95. Figure [4] shows the 
percentage improvement in cRSSj of the CARDS estimator over the OLS estimator. We 
see that CARDS achieves a smaller cumulative sum of squared estimation errors compared 
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Figure 4: Cumulative sum of squared estimation errors of the S&P500 data from December 
1, 2011 to July 2, 2012. The vertical axix is 100(cRSS? i5 - cRSS f bCAHD5 )/cRSSp LS 
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Figure 5: (a)OLS coefficients on the "book-to-market ratio" factor. The x axis represents 
different sectors. (b)Percentage improvement of the cumulative sum of squared estimation 
errors for stocks in Sector 2 "Utilities" . 



to OLS at most time points, especially in the "very-close" and "far-away" future. The 
North American Industry Classification System (NAICS) classifies these 410 companies 
into 18 different industry sectors. Figure ^a) shows the OLS coefficients on the "book-to- 
market ratio" factor. We can see that stocks belonging to Sector 2 "Utilities" (29 stocks 
in total) have very close OLS coefficients, and 17 stocks in this sector were clustered into 
one group in CARDS estimator. Figure[5](b) shows the percentage improvement in cRSS^ 
only for stocks in this sector, where the improvement is more significant. 

6.2 Polyadenylation signals 

The proposed method can be easily extended to more general settings such as generalized 
linear models although we have focused on the linear regression setting so far. In this 
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^ama-lrencn tactors 


AT £ £ 

JNo. of coet. groups 


"market return" 


41 


"market capitalization" 


32 


"book-to-market ratio" 


56 


intercept 


60 



Table 5: Number of groups in fitting the S&P500 data. 



subsection, we will apply the proposed method to a logistic regression example. This study 
tried to predict polyadenylation signals (PASes) in human DNA and mRNA sequences by 
analyzing features around them. The data set was first used in Legendre and Gautheret] 
|2003 and later analyzed by Liu et al. 2003 , and it is available at http://datam.i2r 



a- star .edu. sg/datasets/krbd/SequenceData/Polya.html. There is one training data 



set and five testing data sets. To avoid any platform bias, we use the training data set 
only. It has 4418 observations each with 170 predictors and a binary response. The binary 
response indicates whether a terminal sequence is classified as a "strong" or "weak" polyA 
site, and the predictors are features from the upstream (USE) and downstream (DSE) 
sequence elements. We randomly select 2000 observations to perform model estimation 
and use the rest to evaluate performance. Our numerical analysis consists the following 
steps. Step 1 is to apply the lasso penalized logistic regression to these 2000 observations 
with all 170 predictors and to use AIC to select an appropriate regularization parameter. 
In step 2, we use the logistic regression coefficients obtained in step 1 as our preliminary 
estimate and apply CARDS accordingly. Average prediction error (and standard error in 
parentheses) over 40 random splitting are reported in Table [6j We also report the average 
number of non-zero coefficient groups and the average number of selected features. It 
shows that two versions of CARDS lead to a smaller prediction error when compared 
with the total variation penalty. In addition, the aCARDS has fewer groups of non-zero 
coefficients but more selected features. 



7 Conclusion 

In this paper, we explored homogeneity of coefficients in high-dimensional regression. We 
proposed a new method called clustering algorithm in regression via data-driven segmenta- 
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aCARDS 


bCARDS 


TV 


Prediction Error 


0.2449 (.0015) 


0.2485 (.0014) 


0.2757 (.0026) 


No. of non-zero coef. groups 


5.5000 


21.6250 


5.7500 


No. of selected features 


73.2750 


21.6250 


40.3500 



Table 6: Results of the PASes data. 



tion (CARDS) to estimate regression coefficients and to detect homogeneous groups. The 
implementation of CARDS does not need any geographical information (neighborhoods, 
distance, graphs, etc.) as a priori, which differs it from other methods in similar settings 
and makes it more general to applications. A modification of CARDS, sCARDS, can be 
used to explore homogeneity and sparsity simultaneously. Our theoretical results show 
that by exploring homogeneity better estimation accuracy can be achieved. In particular, 
when the number of homogeneous groups is small, the power of exploring homogeneity 
and sparsity simuntaneously is much larger than that of exploring sparsity only, which is 
justified in our simulation studies. 

To promote homogeneity, the CARDS uses a preliminary estimate to construct data- 
driven penalties. This so-called "hybrid pairwise penalty" is built through a preliminary 
ranking r and a parameter 5 for segmentation. Such idea of taking advantage of a prelimi- 
nary estimate can be generalized. For example, we may apply clustering methods to these 
preliminary coefficients, such as /c-mean algorithm or hierarchical clustering algorithm, to 
help construct penalties and further promote homogeneity. 

This paper only considers the case where predictors in one homogeneous group have 
equal coefficients. In a more general situation, coefficients of predictors in the same group 
are close but not exactly equal. The idea of data-driven pairwise penalties still applies, 
but instead of using the class of folded concave penalty functions, we may need to use 
penalty functions which are smooth at the origin, e.g., the Li penalty function. Another 
possible approach is to use posterior-type estimators combined with, say, a Gaussian prior 
on the coefficients. These are beyond the scope of this paper and we leave them as future 
work. 
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8 Proofs 



8.1 Proof of Theorem I37T1 

Introduce the mapping T : Ma M K , where T(f3) is the ET-dimensional vector whose k- 
th coordinate equals to the common value of f3j for j G A k . Note that T is a bijection and 
T _1 is well-defined for any fx G WL K . Also, introduce the mapping T* : W — > WL K , where 
T*(/3) k = r^r EjeA fe Pj- We see that T* = T on Ma, and T _1 o T* is the orthogonal 
projection from MP to .Ma- Denote (jP = T(/3°) and Ji oracle — T(/3 ). 

Denote L n (f3) = ±\\y - X/3|| 2 and P n {(3) = KYfjZi P(Pr(j+l) ~ Pr{j))y so tliat we 
can write Q„(/3) = L n {(3) + P n (j3). For any fi£R K , let 

1 * -1 

(m) = 2^\\y - x amII 2 , p« (m) = a «^ p{^k+i - ma). 

fc=l 

and define Q^(fi) = L^(ii) + P^(ii). Note that when r preserves the order of /3°, there 
exist 1 = ji < j 2 < ■ ■ ■ < Jk < 3K+1 =p+l such that A k = {r(j k ), r(j fe + l), • • • , r(j fe+ i - 
1)} for 1 < k < K. Then Q n {(3) = Q*(T(J3)) and Q^Qu) = Q^T" 1 ^)) for any /3 G .Ma 
and fji G M A '. 

In the first part of the proof, we show ||/3 — /3°|| = O p (y / K/n). By definition and 
direct calculations, 

||3 0mrfe - /3°|| = ||D(/i ™ c/e - M°)H, M° mde - M° = (Xptx)-^;*. 
Therefore, we can write 

ll3 omde -/3°|| = [KD-^XxD- 1 )-^- 1 ^!!. 

From Condition 3.1, [KD-^XaD -1 ) -1 !! < (cin)' 1 and trp-^XpCAD -1 ) < c 2 niT. 
By the Markov inequality, for any 5 > 0, 



p hin-iv^ii ^ < ^ll p - lx ^ll 2 _ ^(d^x^XaD- 1 ) 

P^||D X A e||>y— J< c2 ^ /5 - — — <ff. 

Combining the above, we have shown that with probability at least 1 — 5, \\/3 ~ (3 \\ < 

Cb- x l 2 ^Kjn. This proves ||/3 - /3°|| = O p (^K/n). 

Furthermore, we can write D -1 X^e = (v^e, • • • ,v^e) T , where = X^ De& and 
is the unit vector with 1 on the A;-th coordinate and elsewhere. Note that ||vfc|| < 
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C2n. It follows from Condition 3.3 and the union bound that 

P ^HD^X^elU > ^/c 2 C3 1 nlog(2n)^ < J^P (j|vje|| > \\v k \\Jt% 1 log(2n)^ < n^K. 

(25) 

Since HD^X^eH < K 1 / 2 \\'D- 1 'X%e\\ 00 , we have 

0orade — /3° ] | < C y/ K \og(n) / n, with probability > 1 - n~ l K. (26) 

^oracle 

In the second part of the proof, we show that (3 is a strictly local minimum of 
Q n ((3) with probability at least 1 — eo — n~ 1 K — 2p _1 . By assumption, there is an event 
E\ such that P(Ef) < eo and over the event Ei, r preserves the order of (3°. Consider 
the neighborhood of f3°: 



B = j/3 G MP : - /3°|| < 2Cy#log(ra)/n} 

soracZe 



By (26), there is an event Ei such that P{E^) < n _1 ET and over the event E2, \\(3 
(3°\\ < C y/ K log(n) /n. Hence, (3 £ B over the event E%. For any /3 G B, write /3* as 
its orthogonal projection to M.a- We aim to show 



(a) Over the event E\ n £2 , 



Qn(/3*) > Qn(3° rade ), for any (3 G 0, (27) 



-^-oracle 

and the inequality is strict whenever (3 ^ (3 



(b) There is an event £3 such that P(E^) < 2p 1 . Over the event E\ n £3 H -E3, there 



exists jB n , a neighborhood of (3 , such that 

Qn(j3)>Qn(P*), for any (3 & B n , (28) 
and the inequality is strict whenever f3 ^ j3* . 

otclcIg - — oracle 

Combining (a) and (b), Q n ((3) > Q n {(3 ) for any (3 G B n , a neighborhood of /3 , 

-^oracle -^oracle 

and the inequality is strict whenever p f p . This proves that p is a strictly 
local minimum of Q n over the event E\ n £2 H £3, and the claim follows immediately. 
Below we show (a) and (b). 
Consider (a) first. We claim 

P^(T*(/3)) = forany/3G£. (29) 
30 



To see this, for a given (3 G B, write fi = T*(j3). It suffices to check \pk+i — > a\ n for 



k = 1, • • • , K - 1. Note that \p k +\ ~ Mfe! > mm ieA fc ,jeA fe+1 \fk ~ Pj\ > min^ \j3$ - /3°' 



2||/3 — /3°||oo > 26 n — 2Cy / K log(n)/n. Since 6 n > aA n 3> ^A'log^/n, it is easy to see 
that \nk+i — Hk\> a\ n . 



Using (29), we see that Q£(T*(J3)) = L£(T*(/3)), for all (3 E B. By definition and the 



fact that . a .^t^ 



d\xdyb T ~ 2n"^A-^-A * s positive definite, fi oracle \ s the unique global minimum 



of L£(/i). As a result, L^{T*{(3)) > L£(fi 



■ Al^-.oracle\ 



^•oracle 

L n (p ), and the inequality is 



strict for any T* ((3) / Ji 



oracle 



Note that Q^ = Q n o and T^oT* is the orthogonal 



projection from IR P to A4a- Combining the above, for any (3 E B, 



Qn((3*) = Q n (T- 1 o T*(/3)) = g A (T*(/3)) = L^(T*(/3)) > L n (p° rade ), 
and the inequality is strict whenever T*((3) / £ orade , i.e., 0* ^ r" 1 (/i oracIe ) = 3' 



oracle 



This proves (27) 



Second, consider (b). For a positive sequence i n to be determined, let 

,„ ,„ r „ ii „ ^oracle,, , 

i3 n = Bn{/3: ||/3-/3 || < t n }. 
Since f3* is the orthogonal projection of (3 to A4a, \\(3 — (3*\\ < ||/3 — /3'|| for any /3' 6 A4a- 

^oracle , 



In particular, ||/3 — /3*|| < — /3 ||. As a result, to show (28), it suffices to show 

Qn(P) > Qn(P*), for any (3 such that \\f3 - (3*\\ < t n , 
and the inequality is strict whenever (3 ^ f3* . 



(30) 



To show (30), write /i = T*((3) so that (3* =T By Taylor expansion, 

p 



QM - Q n (f3*) = --(y-X(3 m ) T n(3-(3*) + J2 



dP n {(3 m 



h + h, 



where f3 m is in the line between (3 and /3*. Consider I4 first. Direct calculations yield 

~KpWt(2) - #r(l))> 3 = 1 

Kp(PT(j) - Pr(j-1)) ~ AnP(/3 T (j+l) - Pr(j)), 2 < j < p - 1 

A n p(/3 r(p) - /3 r (p_i)), j = p, 

where p(t) = p(t)sgn(t) and p(t) = \~ 1 p\(t). Plugging it into I4 and rearranging the sum, 



dPn(P) 

df3 T{j) 



we obtain 



p-i 



h = \n £ p(/^- +1) - /^)) [(/W) - &(,-)) - (/3; (J+1) - /3 T * (i) )] . (31) 
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Note that when r(j) and r(j + 1) belong to the same group, = /3*rj +1 \, and hence 
the sign of (/3™j_^-^ ~ ^r[j)) * s ^ e same as the sign of (/^t-(j+i) ~~ Pr(j)) if neither of them 
is 0. In addition, recall that = {t^), r(jfc + 1), • • • ,r(jk+i — 1)} for all 1 < k < K, 
for some indices 1 = j\ < ji < ■ ■ ■ < jx = p. Combining the above, we can rewrite 

K ifc+i-2 

h = E ^(l^o+D-^DI^i+D-^)! 

fc=l 3=3k 
fc=2 

First, since /3° G .Ma and /3* is the orthogonal projection of (3 to .Ma, ||/3* — /3° 1 1 < 
p-/9 ||. Hence, (3 e B implies /3*,/3 m G £. By repeating the proof of (g9j), we can show 
p(|$w,- fc ) — P^\j k -\\\) = for 2 < /c < isT. So the second term in I4 disappears. Second, in 
the first term ofI 4 , since |/3^- +1) - \ < 2\\(3 m - (3* < 2\\/3 - (3* < 2t n , it follows 
by concavity that — /%-)|) > p'{2t n ). Together, we have 

K jfc+i-2 

/ 4 >A n ^ £ p'(2i n )|/3 T(i+1) -P T{j) \. (32) 

Next, we simplify I 3 . Denote z = z(/3 m ) = X T (y-X/3 m ) and write I 3 = -z T (/3-(3*). 
For any fixed k and / such that t(1) G ^4^ and I 7^ j'^+i — 1, let = {r(j) G ^ : j ' < /} 
and A|, = { r (j) £ it : j > Z}. Regarding that p* ({) = ^ EjL^f * Pr(j) for i G A k , we 
can reexpress ^3 as 

^ A jfe+i-1 ^ ^ A' jfe+i-1 ^ 

13 = -^E E [^(o - #y - ^ E E -^oo [a-cj) - %)] 

fc=l i=j' fc fc=l j=j fe 

a ifc+i-i k ifc+i-i 

-E^pj S ^[^-^-E^j E ^[A-o-j-A-d 

a jfc+1-1 

= ~ E E [^(?) - mo] lA-c?) ~ 

* 1 

= ~ E E [*T(J) - *r(*)] E [^(H-1) ~ Pr®] 

k=l jk<i<j=jk+i-l i<Kj 

A ifc+i-2 p 

= "E^jxt E lA-p+i) - l-Afcil E E Mo 

A jfc+i-2 

= E E ^(/J^^Ttl+l)-^!)]. (33) 



fe=l l=j k 
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where for any vector v 6 W, 

ru2 I i A i I 

«V(i)( v ) = n ~ l jAh/Z Mi) ~ JXT E 

We aim to bound \w t(1) {z)\. Denote 77 = X T X{0*-0°), ry m = X T X(/3 m - 0*) and write 
z = X. T e + rj + ri m . First, to T (j)(v) is a linear function of v. Second, since m lies between 
and 0*, we have ||/3* - m \\ < \\0* - 0\\ < t n . It follows that ||T7 m || < A max (X T X)t n . 
Moreover, \w T ^(v)\ < (\A k \/n)\\v\\oo < (p/ n )ll v ll f° r an v - Combining the above yields 

K(/)(z)| < \w T( i)(X T e)\ + \w T(l) (r])\+ sup K(/)(v)| 

v:||v||<A max (X T X)t n 

< |«; T{0 (X T £)| + |u; r(0 (r7)| + (p/n)A max (X T X)-t n . (34) 
First, we bound the term w T m(X. T e). Let be the event that 

max K (/) (X T e)| < n~ 1 ' 2 ^/2a k \A k \\og{p) / k = 1, ■ ■ ■ , K, (35) 

r(l)eA k 

where we recall a k is the maximum eigenvalue of n _1 X T X restricted to the (A k , ^4fc)-block. 
Given r(Z), we can express u> T (2)(X e) as 

uV (0 (X T e) = a^e, where a r(0 = ^(^X^l^ - 14^X^1^). 

Write L\ = \A\ X \ and L2 = |^|, so that |^4fc| = L1+L2. It is observed that ||X A i \ A \ || 2 < 
n<jfc||l A iJ| 2 < na k L\. Using the fact that (a + b) 2 < 2{a 2 + b 2 ) for any real values a, b, 
we have ||a r(0 || 2 < 2n~ l a k (L 2 2 L l /\A k \ 2 + L\L 2 /\A k \ 2 ) = 2a k LiL 2 /{n\ A k \) < a k \A k \/(2n). 
Applying Condition 3.3 and the probability union bound, 

P(EI) < E E P (l^( (X T e)| > n-VV^IAfcl log(2p)/c 3 ) 

fc=ir(0eA fc 

< J] P(|aje| > ||a,||V21og(p)/c 3 ) < 2p'\ (36) 
i<?'<p 

Second, we bound the term w T ^(rj). Observing that for any vector v, u> T (z)(v) = w T ^(v— 
v k l), where v k is the mean of {vj,j G -Afc}, we have 

. 21^1,11^2,1 \A k \ 

\w T n){v)\ < — — — max\vj-v k \ < — — max \vj - v k \, 

v ' n\A k \ jeA fc 2n jeA k 

Since r\ = X T X(/3* - 0°) and 0* - 0° £ Ma, we have max Jg A fc |r/j -%| < nv k \\0* - 0°\\. 
As a result, 

max K (0 (T7)| < (uk/2)\A k \ ■ \\0* - 0°\\ < Cv k \A k y K\og{n)/n. (37) 
r(i)eA k 
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Combining (33)-(37), we find that over the event E\ n E 2 H E3, 



\h\ 

K ife+i-2 

* E E 

fc=i i=jk 

K jfe+i-2 

^EE, , 

k=\ i=j k 



n 



c( w ^'^ |log(p) + , fc u fc u/^ log(n) l + 



n 



An ^Amax(X T X) , . 

H *n I |Pr(Z+l) ~ Pr(Z)l) 



??■ 



|Pt(Z+1) ~ Pr 

(5 



where we have used the fact A n 3> max^j cr^ | | log(p)/n + v^A^^J K log(n) / n} . 



From (|32|) and (|38|), over the event E 1 C\E 2 r\E 3 , 

K ifc+i-2 



inf [Q n (/3) - Q n (/3*)] > E f%-^Ml^+D-^(0l' 
^ k=i i= jk 1 

where fl'n(in) = n -1 pAmax(X T X)t n - A n [l - p'(2t n )]. Since p'(0+) = 1, g n (0+) = 0. So 
we can always choose t n sufficiently small to make sure \g n (t n )\ < A n /2; consequently the 
right hand side is non-negative, and strictly positive when Ylk=i Sz=j^ 2 IpV(2+i) ~~ Pr(Z) I > 



peB-.\\p-p*\\ 



0, i.e., 0^/3*. This proves (28) 



□ 



8.2 Proof of Theorem Q 



^oracle 



First, we show that the LLA algorithm yields /3 after one iteration. Let E\ be the 
event that r preserves the order of /3°, E2 the event that ||/3 — /3°|| < C\jK log(n)/n and 
£3 the event that d35l) holds. We have shown that P(E 1 nE 2 nE 3 ) > l-e -n~ 1 K-2p~ 1 . 



^oracle 

It suffices to show that over the event E\ n £2 H -E3, the LLA algorithm gives /3 after 
the first iteration. 

Let w j = — /3*Q-^ a '|). At the first iteration, the algorithm minimizes 

1 







initial i 



^oracle 



This is a convex function, hence it suffices to show that /3 is a strictly local minimum 
of Q l ™ ttial . Using the same notations as in the proof of Theorem 3.1 for any P G W, write 
P* = T- 1 o T*(P) as its orthogonal projection to M A - Let B = {p € W : \\P - p°\\ < 



C^jK log(ra) /n}, and for a sequence {i n } to be determined, consider the neighborhood 



^oralce ^oracle 

of P defined by B n = {P £ B : \\P — P \\ < t n \. It suffices to show 



Qinitial^ > Q^ual^ > g 



initial / /o* \ 



-\initial 



09 



oracle. 



for any P £ B n , 



(39) 
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and the first inequality is strict whenever j3 / (3* , and the second inequality is also strict 

-^oracle 

whenever pf p 



We first show the second inequality in (39). For r(j) and r(j + 1) in different groups, 
\P° T{j+1) - P° T{j) \ > 2b n ; also, \0 nitM - (3°]^ < X n /2 < b n . Hence, - > 

26 n — A n > aA n , and it follows that Wj = 0. On the other hand, for r(j) and r(j + 1) in 
the same group, /3 T (j + u — fi T (j) = whenever f3 G .Ma- Consequently, 

Qf^O) = JiHy - X/3|| 2 = L n (/3), for (3 G M A . 
2re 



ovclcIg 

We have seen in the proof of Theorem 3.1 that /3 is the unique global minimum of 



L n constrained on M.a- So the second inequality in (39) holds. 



Next, consider the first inequality in (39). By applying Talylor expansion and rear- 
ranging terms, for some /3 m that lies in the line between (3 and (3*, 

Q™ itial (f3)-Qr tial (/3*) 

p-i 

= X n £ Wj ■ sgn(^. +1) - [(/3 T(J+1) - /3 T(J) ) - (/3; (J+1) - #<,.>)] 
i=i 

--(y - X/3 m ) T X(/3 - /3*) = Ji + J 2 . 
n 

We first simplify Jj. Note that = when r(j') and r(j + 1) are in different groups. 
When r(j) and r(j + 1) are in the same A*, first, = /3* {j) , and \P™ j+l) - P™^] 

has the same sign as [f3 rU+1) - f3 r(j) }; second, \^^ ) -P^ al \ < 20™** - /3°||oo < A„, 
and hence t«j > p'(X n ) > do- Combining the above yields 

if ifc+i-2 K Jfc+x-2 

Ji = A n J^ ^ u?j|^ T y + i) -/3 r(j) | > a A„^ ^ |/3 r(i+1) -/3 r(j) | (40) 



Next, we simplify J 2 . Denote z = X J (y - X/3 m ). Similarly to (&(p|, we find that 

^2 = - E W r(l) ( Z ) [Pr{l+l) ~ &•(!)] , 

fc=i z=i* 

where over the event E3, for any j k < I < j k+ i — 2, 



2a k \A k \log{p) is k \A k \ K\og{n) pA max (X T X) 

Frffl W < \ + o V + tn - 

v V c^n 2 V n n 

By the choice of A n , the sum of the first two terms is upper bounded by aoA n /3 for large 
n; in addition, we choose t n = aonA n /(3pA max (X T X)). It follows that 

K jfc+i-2 

Z( 

~ 3 



\m<y: e ^i^+d-^oi- ( 4i ) 



fe=i «=ifc 
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Combining (40) and (41 ), over the event E\ n E 2 n £3, 

. K ifc+i-2 

fe=i i=j* 



This proves the first inequality in ( 39 ) . 



Second, we show that over the event E\ n E% n S3, at the second iteration, the LLA 

■^.oracle -^-oracle 

algorithm still yields p and therefore it converges to p . We have shown that 

■^oracle -^-oracle 

after the first iteration, the algorithm outputs p . It then treats p as the initial 
solution for the second iteration. So it suffices to check 

^-oracle „ n ,. , 

||/3 -/3°||oo < A n /2. 
This is true because over the event E±, \\/3 — f3°\\ < C^J K\og{n)/n <C A n . □ 

8.3 Proof of Theorem EOl 

Since r is consistent with (3°, there exists 1 = j± < j 2 < ■ ■ ■ < jx+i = p + 1 such that 
Ak = {i~(jk), T~(jk + 1), • • • , r(jk+i — 1)} for all k. We shall write t(J) = j without loss of 
generality. 

In the first part of the proof, we show that (3 G Ma, and it satisfies the sign restrictions 
sgn(/3 A , fe+1 - p A ,k) = sgn(/3° fc+1 - k = 1, • ■ ■ , K - 1. 

When = |t|, Q n (/3) is strictly convex. So /3 is the unique global minimum if and 
only if it satisfies the first-order conditions: 

-ixfe + ±xf X(3 - /3°) - A„,sgn(/3 2 - 4), 







-±xje + ixJX(3 - /3°) + Ansgn(& - /3 J _ 1 ) - A n sgn(/3 J+1 - ft), 2 < j < p 

1 V T„ , 1„T 



-ix^s + ^X(/3 - /3 U ) + A„sgn(/3 P - /3^i), 
where sgn(i) = 1 when t > 0, —1 when i < 0, and any value in [—1, 1] when t = 0. 
Therefore, it suffices to show there exists (3 £ Ma that satisfy the sign restrictions and 
the first-order conditions simultaneously. 

For (3 G Ma, we write Jl = T((3) and (jP = T(/3°), where the mapping T is the same 



as that in the proof of Theorem 3.1 The sign restrictions now become sgn(/tfe + i — jlj.) = 
sgn(/x2 +1 — jU^) for all k = 1, • • • , K — 1. Note that /3j = when predictors j and 
(j + 1) belong to the same group in A. The first-order conditions can be re-expressed as 

-^xje + ^xjx^/x - fjP) + A n sgn(/i fc - /2 fc _i) - X n rj, j = j fc 
-ixje+ ixJX^O- M ) + A n r,-_i - X n sgn(fi k - fe-i), j = jk~l ( 42 ) 
-^xje + ixJX^O ~ A* ) + ^n r j+i ~ Krj, elsewhere, 



36 



where r/s take any values on [—1, 1] and we set sgn(/ii — £iq) = sgn( / u^ + i — fix) = by 
default. Denote by 5^, = sgn(/x^ +1 — fx®) when 1 < k < K — 1 and 5® = when k = 0, K; 
similarly, 5^ for 1 < k < K, In ( |42| ), we first remove r^-'s by summing up the equations 
corresponding to indices in each A&. Using the fact that x A ,k = Ylij<^A k x j' we obtain 

-k x A,k £ + i^A^AifJ- - A* ) + A„4-i - A n 4 = 0, = 1, • • • , K. 



Under the sign restrictions 8% = 6%, k = 1, • • ■ , K — 1, it becomes a pure linear equation 
of 

-iX^e + ±X A X A (/2 - /*°) + A n d° = 0, 

where d° is the iT-dimensional vector with d\ = 5® — as defined in Section 

follows immediately that 



3.2 



It 



1-lvT, 



(43) 



Second, given (/i — ( |42[ ) can be viewed as equations of r^'s and we can solve them 
directly. Denote 6 = ^X T X j4 (/i- - ^X T e. For each j 6 A fc , define AjL = {j^, • • • , j} 
and A^ = {j + l,~- ijfc+i ~~ !}■ The solutions of M2| are 



+ a; 1 E 0i = Sk-K 1 E 



j G A fe . 



Here the two expressions of r,- are equivalent because A n X^eA ft ^» = — ^fc-i from (42). It 
follows that any convex combination of the two expressions is also an equivalent expression 
of rj. Taking the combination coefficients as |AL|/|Afc| and |Ajy|/|Ajfe|, and plugging in 
the sign restrictions 5k = 8%, k = 1, • • ■ , K — 1, we obtain 



\ f'5; 1 e 



\A k 



I4 1 I 
lAfcl 



I Af, I 



I A} 



+ 



4 2 I I4 1 I 



1-4 



A t 



At 



\A k 



where the function Wj(-) is defined as in (33). Here r^'s still depend on (/i — /i u ) through 



0. Combining (43) to the definition of 6 gives 



o = -^x T [i - x A (x5x A )- 1 x5] e + A^x^x^x^-M 



±-X T P A e + \ n b°, 
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where Pa = I — X^X^X^) X X^ and b° is defined as in Section 3.2 By plugging in the 
expression of 0, we can remove the dependence on (/2 — fi°) of the solutions of r^-'s: 



-X^WjCX^Ae) + n Wj (b°) + 



\Al 



\A 1 I 



A — 1 + |^ | 



(44) 



Now, to show the the existence of (3 £ A^a that satisfies both the sign restrictions and 
first-order conditions, it suffices to show with probability at least 1 — eo — n~ l K — 2p~ 1 , 



(a) the rj's in (44) take values on [—1, 1]; 



(b) the Ji in (42) satisfy the sign restrictions, i.e., sgn(/}fc +1 — jx^) = sgn(/^ +1 — n®) for 
all k = 1. 



Consider (a) first. In (44), under the "irrepresentability" condition, the sum of the last 
two terms is bounded by (1 — oj n ) in magnitude. To deal with the first term, recall that 



in deriving (35), we write Wj(X. e) = aj e. It follows immediately that Wj(X P^f 



sl, Pa£ = (PA a j) e. Since ||P^aj|| < ||aj||, similarly to (35), we obtain 



max \ wj(X. T PA^)\ < Cy/ak\Ak \ log(p)/n, 1 < k < K , 

except for a probability at most 2p~ l . Therefore, by the choice of X n , the absolute value 
of the first term is much smaller than u n . So maxj \rj\ < 1 except for a probability at 
most 2p _1 , i.e., (a) holds. 

Next, consider (b). Since |/^ , 1 — //°J > 2b n , it suffices to show that — a* I loo < b n . 



Note that (43) can be rewritten as 



p, - n° = A^D-^D-^X^D-^-^AnD-M + D^X^e). 



It follows from Condition 3.1 that \\fi— A* u || < c 1 1 (A„||D 2 d 



\r>- 1 \\\\n~ 1 x.le\ 



First, 



note that ||D- 2 d°|| 2 < ±Y,k=i\Xtf- Second > from & II d_1x a £ H ^ Cy/nK log(n), 



except a probability of at most n 1 X. Moreover, ||D 1 
together imply 

K 1 

P-M || <CA n (^ — 



minfc |^4.fe|) < 1. These 



fc=l 



1/2 + C l Kl °^ 



k\"> v n 

From the conditions on b n: the right hand side is much smaller than b n . It follows that 
Ha* — A*° I loo &n- This proves (b). 

In the second part of the proof, we derive the convergence rate of \\(3 — /3° ] | . Note that 
-/3°\\ = \\D(fi - A* ) ||, and from @, 



D(/* - A*°) = (jD-^XxD-^-^AnD-M + n^D^X^). 
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Therefore, ||/3-/3°|| < c^ 1 (A n ||D _1 d°|| + n^D^X^e), where HD^d !! 2 < iY,k=iJ^\ 
and ||D _1 X^e|| = O p (y/nK) by (25). Combining these gives 



||3-/3°||=O p (y]^+A n ,(^^) 1/2 



8.4 Proof of Theorem l3^tl 

The order generated by (3 preserves the order of (3°, if and only if, /3f < /3j implies 

pals < pais for 

any pair 1 < i,j < p. Note that when (3® < ft®, necessarily 09 — 0® > 
2b n . Moreover, 0f s - 0f s > (09 - 0$) - 20° ls - p ^. So it suffices to show that 
\\/3 ~ /3°||oo < b n with probability at least 1 — 2p _1 . 

From direct calculations, p ols = (3° + (X T X)" 1 X T £. Let a; = X(X T X)- 1 e j , j = 
1, ■ ■ ■ ,p. Then ||a,-|| 2 = eJ(X T X)- 1 e i < cm' 1 - Note that 0f s -0] = aje. By Condition 
3.3 and applying the union bound, 



P (||3° S - (3°\\oo > x/(2c4/c 3 )log(p)/n 



V V 

< J2 P {\ a J £ \ > l|a||V21og(p)/c 3 ) <E 2 ^ 2 - 

j=i i=i 

So, with probability at least 1 — 2p _1 , [|/3 — /3°||oo < &n- This completes the proof. □ 



8.5 Proof of Proposition 4.1 



Consider the first claim. Given k, let = min{7 : V^i 7^ 0} and = max-{7 : Vki 7^ 0}- 
Then = Ujjl d Vki- Moreover, for any c4 < I < lt&, 

B° A k < max $ < min fl? < max < min /3° < <8,° M , 

where the first and last inequalities are because Ak n 7^ and n jB UJl 7^ 0, and the 
inequalities between come from Definition K3\ It follows that $ = 0° A k for all j £ B h 



This means B\ c4, and hence V/^ = B\. 

Consider the second claim. Given I, let a/ = minjfc : Vjy 7^ 0} and bi = max{k : Vki 7^ 
0}, and so Bi = U b k l =a Vki. For any a\ < k < bi and I' < I, 



m^0?<mm0°<0% ai <0l k , 



where the first inequality comes from Definition 2.3, the second inequality is because 



A ai n B\ 7^ and the last inequality is from the labelling of groups and the fact that 
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ai < k. It follows that By n A k = 0. Similarly, for any V > I, By n = 0. As a result, 
A k C Bi and V H = A k . □ 



8.6 Proof of Theorem I3~T1 



Recall the mappings T, T 1 and T* defined in the proof of Theorem 3.1 Write Q n (f3) = 
L n (j3) + P n ((3), where L n ((3) = ±\\y - X/3|| 2 and P n (/3) = PtmmH 3 )- For an y ^ rA '. 
let 

l£(j*) = L n (r _1 (/i)), p n A (/i) = PnCr-V)), 

and define Q^(» = £ A (m) + P^O)- 

— OTQidc 

We only need to show that /3 is a strictly local minimum of Q n with probability 
at least 1 — eo — n~ 1 K — 5p -1 . Let E[ be the event that B preserves the order of (3°, 



and define the event E2 and the set B the same as in the proof of Theorem 3.1 For an 



event E' 3 to be defined such that P((E' S ) C ) < 5p , we shall show that (27) and (28) hold 
on the event E[ n E2 n E' 3 . The claim then follows immediately. Similar to the proof of 



Theorem 3.1, it suffices to show (29) and (30). 

Consider (29) first. Recall that V k i = A k C\ B\. Define m\^ k i = Sfci 1 (l^fcill^fe'(z+i)l + 
|Vfe'j[||^(i+i)|) and m 2 ,kk' = Ya=i l^wll^'il) for 1 < k < k' < K. Write for short pi = p Xl 
and P2 = p\ 2 - It follows that 

P niy) = ^ E m ltkk , pi(\p k - p k ,\) + \ 2 E m 2,kk> P2{\Pk - Mfe'D" 
\<k<k'<K \<k<k'<K 

Therefore, it suffices to check mm kj L k , \p k —p k i \ > a max{Ai ra , \2n}- Note that the left hand 
side is lower bounded by 26 n — 11/3— /3° I |oo > 2b n — Cy / Klog(n)/n 3> b n > amax{Ai„, \2n}, 



which proves (29). 



Next, consider (30). For (3 E B, write /x = T*(/3), (3* = T By Taylor expansion, 



Qn(P) ~ Qn(P* 



1 



n 



X/3 m ) T X(/3-/3*) + E 



<9/3 7 - 



(P j -P*) = K 1 + K 2 , 



where j3 m is in the line between (3 and (3*. Let pi(i) = /^(i)sgn(i), i = 1,2. By rearranging 
terms in K2, we can write 

L-l 

^ = AiE E frOT-^) [(ft -ft) -(#-#)] 

1=1 ieB u jeB l+1 

L 

+ A 2 E E ^(/?r - p?) [(& - ft) - - #)] • 

z=i ijeBj 
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For those i,j not belonging the same true group, \j3f l — > 2b n — 2\\(3 m — /3°||oo > 26 n — 
||/3* — /3°||oo- Similarly as before, we obtain fij 1 ]) = 0. On the other hand, for those 

i,j belonging to the same true group, (3* = [3* and hence sgn(/3™ — f3™) = sgn(/3j — f3j). 
Together, we find that 

L-l L 

V 1 J nam , 



k 2 = xiY, E pi(i^r-^ m i)iA-/3ii+A 2 E E ^(i/ar-CTA-ft 



_1 iEB t] jEB t+1 ,i£j 1 1 ij&B u i&j 



L-l L 

^ Al E E p / i(2t„)|A-/3,| + A 2 ^ ^ p' 2 (2t n )|A-/3il, (45) 

' _1 i£Bi,j£B i+1 ,i£j Z_1 i,jeBi,i£j 

A 

where i ~ j means i and j are in the same true group, and the last inequality comes from 
the concavity of p and the fact that |/3|™ - /3J 1 ] < 2||/3 - (3* < 2t n . 

Now, we simplify K x . Let z = z(/3 m ) = X. T (y-X/3 m ) and write ifi = -^z T (/3-/3*). 

Note that for each j £ A k , (3* = ^ J2ieA k A = pfcf E"^ £iev fei where ^ d fc and 
itfc are as in Proposition |4.f[ 

V -•[■! - ,. 



-. K u k 

^ = --EE 



n 

fc=l Z=<4 jGVfei 



fe=i«=d fe jev fe! 1 ft| z'=d fe i'6V fc! , 

-sEmE E E !>-*)<a-/%o 

fc=i 1 fc| i=d k v=d k jev kl j'ev hl , 

K u k 

-^E^E E fc-v)<ft-A0 

fc=l 1 hl l=d k j,j'GV kl 

"^ErH E E 

fe=i 1 fe| d k <i<i'<u k jeVkufe v kl , 

K n + K l2 . 



Using notations in Proposition 4.1 Ylk=i Yll=d k = Ylb=i Efc= a; - Therefore, 



L b 



/v "u "iEE E \JT\( z i - *i')0*J - Mi') 



2ji^^ ^ I A.i 

z=i k= ai j,j'ev kl 1 fc| 



= "^E E (46) 

/_1 j,j'&Bi,j£j' 

where Ojj/(z) = r^7(% — for G A^. To simplify if 12, note that given any (j, j') 
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such that j G V k i and f G Vfcj/, for some k and I < I', we have 

1 



A' - ft 



nl=i+i \ v kh\ 



E 



i'-i 



Plugging this into the expression Kn, we obtain 



S(h,h+i,- ,v) : »i=J> »('=/;]. 
i iieVH./i=i+i,-/-i J 



12 



1 * 1 

TTTI 5^ 



E 



( z il z i t i ) 



Eos* - 



n fc=i l Afc l rf fc <J<i'<tt fc {(ij,ij +1 ,...,i,,): i h ey fch } n?i=i+i l^kl h=; 

1 x 1 



lh+1 



d k <l<l'<u k h=l j&V kh ,j'eV k(h+1) 



where for (j, /, /, Z', /i) such that j G V^, / G V fc(/l+1 ) and / < /i < /' - 1, 



Z = h = V - 1 
l = h<l'-\ 



l<h=V -1 



and z w is the average of : j G V^}. By rearranging terms, Ylk=i T,d k <l<l><u k Y?h= 



J2h=i Y?k=a h ^2(l,l'):d k <l<h<l'<u k - Therefore 

, L-l b h 

Kl2 = —EE 



j _ h u k 

h=l k=a h j£V kh ,j'£V k ( h+1) l=d k l'=h+l 



L-l 



n ^ ^ 



Tjj>{x){pj ~ fa), 



h ~ x jeB h ,j'eB h+1 ,j£f 



where 

T jf {z) 



^ h u k 
1 fc| i=d k l'=h+l 



h—1 u k 



+ 



oE E 

i 

1_ ^ \V, r \ 



1*^11*^1 __ x , 1 \Vm\ _ v 

w £ A, I^IIIW* " ^ w 4 1^1 ( * H " Zf) 

u k 



fa - zw) + j^- fa - Zj>) 



h-l 



1 mi\(Evt h+1 \Vkv\) _ 1 J2Z h+1 \v kl >\ 
\M^ k \v kh \\v k(h+1) \ Zkl+ \A k \ \v k(h+l) \ z > 



1 Ufc 

o E 



=/i+2 



l^ft||^fc(h+i)l 



ZkV 



j_zLdjVki\ 

\A k \ \V kh \ 



-Zii 
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l^4/i||V fc (fe+i)| 
l 



-EE 

l=d k i&V k i 



Zi 



i^i 



E E 



i'=h+nev> 



kl' 



\Vkh\\Vk(h+l)\ 



1^1 



Let Aj^ = Ui< h V k i and = Uj>/ l Vjfcj. Then for any (J J') such that j G / G 
and j, j' G j4fc, we have the following expression 



r,y(z) 



E 



1 

fcfc 



+ 



|Vkh||Vfc(/H-i)l L \A k \ . 



E(%-^) + ^fE( 



eA lh 



Zi Zji t 



(48) 



Combining (45), (46) and (47) gives 



L-l 



Q n (0)-Q n {ir) = E [Aip / 1 (2t n )|A-^|-n- 1 ^(z)(A-^)] 



' 1 i£B u j£B l+1 ,i&j 



+ E E M(2t„)|# - &| - n-HijWiPi - &)] . 



= 1 - »A. 



Therefore, to show (30), we only need to show for sufficiently small i r , 



n 1 max\Tij(z)\ < X 1 p' 1 (t n ), n 1 max |%(z)| < \ 2 p' 2 (t n ). 



(49) 



Note that z = X^+rj+r/" 1 , where rj = X T X(/3*-/3°) and rf 1 = X T X(/3 m -/T). It is 
seen that ||r7 m || < A max (X T X)||/3-/3*|| < A max (X T X)t n . So t^(z) = ^ (X r e + ry) + rem, 
where the remainder term is uniformly bounded by g(t n ) with g(0) = 0. Similar situations 



are observed for 9ij(z). As a result, to show (49), it suffices to show 



and 



n 1 max |%(X T e + rj)\ < X 2 p' 2 (0+), 



n 1 max |r» i3 -(X T e + rj)] < Aip' 1 (0+). 



First, consider (50). Let E' 31 be the event 



(50) 



(51) 



n- 1 max |0„(X T e)| < a/c3 1 log(2p)/(n| A fc |), for all k. 

Note that 0ij(X. T e) = r^r(x, — Xj) T e, where ||xj — x-,|| = \/2n. Moreover, the number 
of such pairs is bounded by \Ak\ 2 /2 < p 2 /2. Applying Condition 3.3 and the union 
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bound, we see that P((E' 31 ) C ) < p 1 . Moreover, \0ij(r])\ < r^-ymaxj/ 1 77^/ — r] k h\, where 
fjkh is the average of {rji : i 6 F fe/l }. Note that max ie v/ feh |r/j - f/ fc/l | < ra/fc[|/3* - /3°|| and 
||/3* — (3 || < 11/3 — /3 || because /3* is the orthogonal projection of /3 onto A4a- Noticing 
that j3 £ B, we obtain 

n _1 max|%(r7)| < Cv k \ A k \ ~ l K log(n) /n. 



Combing the above results to the choice of A2 yields n 1 maxjj |%(z)| <C A2, and (50) 
follows. 

1 



Next, consider (51 ). In (48), the first term can be written as 



|Vfch||V fe ( ft+1 )| 



w kh (z), where 



w k h(z) has a similar form to that of Wj{z) in (33). Let E' 32 be the event that 



n 1 mEK w )!fl X J e + ?) < C \ Vv k \A k \\ 

k,h \ V n V n 



(52) 



It is easy to see that we can follow the steps of proving (35 ) and (37) to show P((E' 23 ) C ) < 

1 1^ - j 1 [w^, (X T e) + 

\Vkh\\Vk(h+l)\ L JJ v ' 



2p 1 , Write the second term in (48) as , T , MT ; rio,-,v(z) 

1 r |Vfefc||V%(h+i)| 



u)jj/(?7)]. First, let £"33 be the event that 



n 



1 max \ wjji(X T e)\ < C\/a k \A k \ log(p)/n. 



.1:1 



(53) 



We observe that n 1 Wjj>(X T e) = —aj^e, where 



\ A kh\ 
\A k \ 



X a?, (!aL + e i 



I4 2 I 



So ||a„v|| 2 < 2n-V fc [ r 0(L 2 + 1) + + *)] < n _ V fc (|A fc | + 4), where L x = |A* 



and L2 = \Al h \. Similar to (36), we can show P((E , 33 ) C ) < 2p . Second, note that 



Mii'fa)! < ( Z,1 j^|" 1) + ) m axiev fc h l^/t - %hl < C|^ fc |max ie v feh \m ~ Vkhl where 

max ie y^ |r/j - fj kh \ < nv k \\(3* - /3°|| < Cv k ^JnK log(ra). As a result, 



1 max I Wjf(r]) \ < Cv k \ A k \ K log(ra) /n. 



(54) 



Let £ 3 = £ 31 n £ 32 n £33, where P((E' 3 ) C ) < 5p -1 . Combining ^2))-@) gives 

1 



rt 1 max |r,- ,/ (X T e + rj) \ < C max 



o"fc|^fc|log(p) . /mog(n) 



?? 



over the event £1 fl i?2 H E3. By choice of Ai, the right hand side is much smaller than 



Ai . This proves (51). 



□ 
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8.7 Proof of Theorem IP1 

Write fjp = T(J3°) and £ = T{0) = @ A . Let Q$(n) = L*(p) + P^(n) be as in the proof 
of Theorem 14.11 Denote 



B° n = {^eR K : ||D(/i - ^°)|| < y/K\og(n)/n} 

be a neighbourhood of /Lt . We have seen: (i) fi, £ with probability tending to 1; (ii) 
P„ (//) = for fj £ BjJ; (hi) /x is a strictly local minimum of in £>°. Combining the 
above, we find that 

|^(£) = -iX^-X A £) = 0. 
It follows from y = X^fi + s that 

Therefore, to show the claim, it suffices to show B„(X^X,4)~ 1 / 2 X^e -4- N(0, H), i.e., for 
any a G R 9 , 

a T B n (X^X A )~ 1/2 X^£ 4 JV(0, a T Ha). (55) 
Let v = X A (X^X J 4)~ 1 / 2 B^a, and write the left hand side of (55) as v T e = Y^i=i v i £ i- 



The ViSis are independently distributed with E[vi£i] = and -E[|ui£j| 2 ] = vf. Let s 2 = 
^Ji = i E\\ViEi\ 2 ]. By Lindeberg's central limit theorem, if for any e > 0, 

lim s~ 2 E\\viEi\ 2 l{\viEi\ > es n }] = 0, (56) 



then s^ 1 X^ILi Vi£i ~^ ^(0) !)■ Since s 2 = a T B n B^a — > a T Ha, (55) follows immediately 
from the Slutsky's lemma. 



It remains to show ( |56j ). Using the formula £[X1{X > e}] = eP(X > e) + P(X > 
u)du for X = |?7j£j| 2 , we have 

E[\vi£i\ 2 l{\vi£i\ > es n }] = e 2 s 2 n P{\viei\ > es n ) + / P(\viEi\ > y/u)du 

J £S n 

From Condition 3.3, P(\viEi\ > es n ) < 2 exp(— C3e 2 s 2 /|t> j| 2 ) and P(\viEi\ > y/u)du < 

2 exp(— csu/\vi\ 2 )du = 2\vi\ 2 /cs exp(— c%es n /\vi\ 2 ). Note that exp(— x) < x~ k for any 

x > and positive integer k. It follows that 

1 n 1 n I I ^ 2 1 1 2 1 t> 1 2 

-2 ^P[|^| 2 1{|^| >es n }] < ^ ^ hes n -^ + !* ) <<?maxH 2 , 

where in the last inequality we have used the facts that s n = Y2i=i \ v i\ 2 an< ^ s n l = 0(1). 
Note that llvHoo < ||X A (X^X yl )- 1 || 00 ||B^| 2i00 ||a|| = o(l). □ 
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8.8 Proof of Corollary |4.1 



It is easy to see that the asymptotic variance of a„(p — p ) is a„(X X) a„ = v\ n . 
To compute the asymptotic variance of a^(/3 — /3°), note that 

a^(3 - /3°) = a^M n D(X^X yl )- 1 / 2 (x5x A ) 1 / 2 (3 A - /3°0, 



where D = diag(|^4i| 1//2 , • • • , |t4x| 1//2 ). Applying Theorem 4.2 the asymptotic variance is 

a^M n D(x5x A )- 1 DM^a n . 

Since X A = XM n D, the above quantity is equal to a^M n (M^X T XM n ) _1 M^'a n = V2 n - 
Next, we show v\ n > vm- There exists an orthogonal matrix Q such that M n is equal 
to the first K columns of Q. Write b = Q T a n and G = Q r X r XQ. Direct calculations 
yield v\ n = b T G x b and V2 n = b^G^bi, where bi is the subvector of v formed by 
its first K elements and Gn is the upper left K x K block of G. From basic algebra, 
Vln >v 2n - □ 

8.9 Proof of Theorem I3~3l 

It suffices to show that (3 is a strictly local minimum of Qrf arse with probability at 
least 1 — eo — n~ 1 K — 2p~ l . First, there exists an event E\ such that P{Ef) < eo and 
B preserves the order of /3° over the event E\. Second, for a sufficiently large constant 
C, define B as the set of all f3 such that ||/3 — (3°\\ < C\jK log(n) jn. By recalling the 



proof of Theorem 3.1 we see that there exists an event E2 such that -P(-K>) < n 1 K and 



. — ovclcIg 

(3 G B over the event E2. Third, for any j3 € B, let (3 S be the vector such that 
Ps,j = ^ S}, where 5 is the support of /3°; and let (3* s be the orthogonal projection 

of (3 S onto M\- We aim to show there exists an event -E3 such that P(E^) < 2p~ 1 and 
over the event E\ n E2 fl E3: 

QT rse iP*s) > Q s n parse (P° rade ), ^ any (3 € B, (57) 

— ov&cIg, 

and the inequality is strict whenever (3* s ^ f3 ; for a positive sequence {t n }, 

^-v. CiTCl d P 

QT rse (Ps) > QT rSe (P*s), for any (3 e B and \\0 8 -(3 \\< t n , (58) 
and the inequality is strict whenever (3 S ^ (3* s ; for a positive sequence {t' n }, 

Cifn cl p 

Q s n parse (P) > Qn Parse (Ps), for any (3 e B and \\(3 -(3 \\< t' n , (59) 
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and the inequality is strict whenever (3 ^ (3 S . 



oracle 



Suppose (57)-(59) hold. Consider the neighborhood of (3 defined as B n = {(3 G B : 

--^oracle -~~oracle -~oracZe 

||/3 — (3 || < min{i n , t n }\. It is easy to see that \\(3 — (3 \\ < t n and \\(3 S — (3 \\< 
H/3 _ 3 omde || < tn for any G B n . As a result, Q;T rse (/3) > 3^(3°™*) for /3 G B nj 

-^orac/e -^.oracle 

and the inequality is strict except that (3 = f3 s = (3 S = (3 .It follows that {3 is 



a strictly local minimum of Q 



sparse 



Now, we show (57)- (59). The proofs of (57) and (58) are exactly the same as those of 
(27) and (28), by noting that QT arse (j3) = Q n ((3) for any (3 whose support is contained 



in S. To show (59), write 

QT rse {P) - Qn parse (Ps) = --(y - x/3 m ) T x(/3 - (3 S ) + \J2p(PT)^ 

where f3 m lies in the line between (3 and (3 S . First, note that sgn(/3™) = sgn(/3j) for 
j i S. Second, \\(3 m - (3 S \\ < \\(3 - (3 S \\ < t' n . Hence, for j £ S, | < t' n . By the 
concavity of p, p'{\f3f\) > p'(t' n ). Third, write z = X T (y-X/3 m ) = X T e + r] + r] m , where 
r} = X T X(/3° - p s ) and rf n = X T X(/3 S - (3 m ). Combining the above, 

Qr rse ((3)-Q s n parse (Ps) > Etvco-^S-Pii > E w$+)-\\& T eu-9nQ)] i/h 

where g n (t' n ) = A[Ap'(0+) - \p'(2t n )] + n" 1 ^™ satisfying ^(0) = 0. First, from 



Condition 3.1, ||-X £||oo < y c 3 log(2p)/n, except for a probability of 2p . Since 



X n 3> y / log(p)/n, when n is sufficiently large, Cy4og(p)/n < X n /2p' (0+) = X n /2. Sec- 
ond, since g n (0) = 0, we can always choose t' n sufficiently small so that g n (t) < A n /4 for 
any < t < t' n . Combining the above gives 

Q sparse (f3) _ Qsparse^^ > £ ^|^.|_ 



Then ( 59 ) follows immediately. The proof is now complete. □ 



References 

H. D. Bondell and B. J. Reich. Simultaneous regression shrinkage, variable selection, and 
supervised clustering of predictors with oscar. Biometrics, 64:115-123, 2008. 

D.A. Darling and P. Erdos. A limit theorem for the maximum of normalized sums of 
independence random variables. Duke Math. J., 23:143-155, 1956. 

47 



J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle 
properties. Journal of American Statistical Association, 96:1348-1360, 2001. 

J. Fan and J. Lv. Sure independence screening for ultra-high dimensional feature space. 
Journal of Royal Statistical Society B, 70:849-911, 2008. 

J. Fan and J. Lv. Non-concave penalized likelihood with np-dimensionality. IEEE - 
Information Theory, 57:5467-5484, 2011. 

J. Fan, J. Lv, and L. Qi. Sparse high-dimensional models in economics. Annual Review 
of Economics, 3:291-317, 2011. 

J. Fan, L. Xue, and H. Zou. Strong oracle optimality of folded concave penalized estima- 
tion. Manuscript, 2012. 

A. Fred and A. K. Jain. Robust data clustering. Proceedings of IEEE Computer Society 
Conference on Computer Vision and Pattern Recognition, 3:128-136, 2003. 

J. Friedman, T. Hastie, H. Hfling, and R. Tibshirani. Pathways coordinate optimization. 
Ann. Appl. Stat., 1:302-332, 2007. 

M. Legendre and D. Gautheret. Sequence determinants in human polyadenylation site 
selection. BMC genomics, 4, 2003. 

H. Liu, H. Han, J. Li, and L. Wong. An in-silico method for prediction of polyadenylation 
signals in human sequences. Genome Inform Ser Workshop Genome Inform, 14:84-93, 
2003. 

X. Shen and H.-C. Huang. Grouping pursuit through a regularization solution surface. 
J. Amer. Statist. Assoc., 105(490):727-739, 2010. 

S. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smoothness 
via the fused lasso. J. Roy. Statist. Soc. B, 67:91-108, 2005. 

H. Zou and R. Li. One-step sparse estimates in nonconcave penalized likelihood models 
(with discussion). Ann. Statist, 36:1509-1566, 2008. 



48 



